Porting From SQL With NHibernate to RavenDB

In this post I will overview a port of a persistence implementation of a Domain-Driven Design project from a solution based on SQL Server and NHibernate to a solution based on RavenDB. I describe the problem domain, the current stack, project goals, the database selection process, as well as serialization, indexing and optimization patterns.

Domain

The domain is product pricing of an online retailer and accordingly the primary functionality is the calculation of prices for products based on multiple variables. The central aggregate in this domain is a product pricing strategy which consists of price calculations for each supplier of the product and is keyed by the ID of the product. Prices are updated for a variety of reasons and the trigger can be event based or manual. The events which trigger a price update include any events that may result in a price change, such as a supplier cost change. Manually triggered price updates are invoked by members of the ecommerce team and are happen for a variety of reasons, such as seasonal experimentation. Manual updates are initiated by a selection query which defines the set of products to update. This selection query filters by attributes belonging to products, such as product category, as well as attributes belonging to the product’s pricing strategy, such as margin. Products themselves are aggregates from a different bounded context and can be entities or value objects in the pricing context. Currently, the system manages pricing for approximately 2 million products.

Stack

The system runs on the .NET platform, uses SQL Server for persistence and NHibernate for the ORM. Messaging is implemented with NServiceBus. Business logic is implemented using Domain-Driven Design. The functionality is exposed via HTML interface delivered by ASP.NET MVC and ASP.NET WebAPI.

Goals

The port project had several goals, including a streamlined mapping with the persistence layer, more advanced querying capabilities and moving as much logic out of the database as possible. A performance improvement was to be a bonus. Additionally, severing the dependence on SQL Server was certainly a plus. Given these constraints and the fact that the existing system was implemented using Domain-Driven Design, a NoSQL document database was deemed an appropriate solution. The document database had to be able to both store and query serialized representations of domain entities. Additionally, some of the more elaborate querying requirements would call for Map/Reduce capabilities.

NoSQL database selection

There are numerous document databases available in the wild that can satisfy the goals of the project. Among the ones considered were RavenDB, MongoDB, CouchDB, SimpleDB, DynamoDB, Azure Table Storage, Cassandra, HBase and Redis. SimpleDB does provide decent querying capabilities, but tables are limited to 10GB and performance seemed questionable. DynamoDB, while highly performant, is a key-value store which means that it doesn’t support secondary indexes. Azure Table Storage has similar drawbacks. Cassandra and HBase can both fulfill all stated requirements, however the configuration and calibration of all moving parts seemed like a hassle. Overall, they are more focused on performance and immense scalability than ease of use. Redis is better suited for storing performance sensitive soft real-time data and it is also a key-value store. Map/Reduce is possible with any NoSQL database, but it requires additional configuration. A good overview of the various NoSQL databases is provided by Kristof Kovacs.

It came down to a decision between RavenDB, MongoDB and CouchDB. All three store serialized representations of entities, support complex queries as well as Map/Reduce. MongoDB has amassed an impressive client list and its indexes are easy to grasp for SQL developers. In CouchDB, all querying is implemented with views which are generated via Map/Reduce. In RavenDB, queries are served by Lucene and the corresponding indexes are defined with LINQ which also supports Map/Reduce declarations. After a bit of exploration, I discovered that a significant advantage that RavenDB had over the others was that its Map/Reduce indexes were updated automatically based on changes in related documents. MongoDB would require an out of band scheduled process to update Map/Reduce collections and CouchDB, by default, only updates Map/Reduce views on demand, not immediately after a document update. While MongoDB indexes are updated in place, some of the more intricate querying requirements would require use of Map/Reduce, as stated. Another advantage RavenDB had over the others is its support of transactions across multiple documents. This is an important feature because many use cases, in addition to storing and updating an entity, store a domain event.

During data migration, MongoDB exhibited better write performance - almost an order of magnitude faster than RavenDB. In testing actual use cases however, RavenDB was on par with MongoDB. Existing use cases involved loading a single aggregate, performing an operation and persisting the resulting changes and domain events. Since migration would only happen once, the advantage MongoDB has in terms of write performance was negligible overall.

Additionally, RavenDB runs on the .NET platform and is open-source. I’ve taken advantage of that on several occasions either by looking at the code to understand certain behavior, or writing extensions to existing functionality. Ultimately, RavenDB was selected as the NoSQL document database for this project.

Anatomy of RavenDB

RavenDB is built on the .NET platform and makes use of two major components, the Esent database engine for document storage and Lucene for indexes. Document storage is ACID while Lucene indexes are updated in the background in an eventually consistent manner. This is effectively CQRS right out of the box! The CAP theorem implies a continuum between ACID and BASE and for this particular project RavenDB resides in a sweet spot between the two. For serialization, RavenDB uses JSON via the Json.NET library. The Unit of Work pattern is supported in a way similar to NHibernate with the IDocumentSession interface which is yet another advantage RavenDB has over the competition. RavenDB supports transactions as a durable resource manager from System.Transactions. Aside from enabling transactional semantics between multiple document sessions, this is helpful for database integration testing by allowing rollback of changes resulting from a test. The NHibernate solution already contained such integration tests which ended up being completely portable. One caveat is that changes aren’t visible until the transaction commits, which is slightly different from the behavior of SQL Server.

Serialization

RavenDB automatically serializes objects upon persistence and deserializes them upon reconstitution. Serialization is performed automatically but can be customized in several ways. The simplest way is to use the DataMember attribute from the System.Runtime.Serialization namespace. For deeper control, serialization attributes from the Json.NET library can be used.

Beyond that, it is always possible to use an intermediate object which can serve as a DTO between the object model and the database. In this way, the object model is completely decoupled from the persistence layer. However, the intermediate object approach has several drawbacks. One drawback is that it adds a layer of indirection which is prone to error and requires maintenance. Another drawback is that it makes use of the unit of work pattern more difficult, because RavenDB will track changes to the persisted DTO object, not the corresponding domain object.

The best approach to serialization is by convention which can be done by customizing the JsonSerializer and the JsonContractResolver used by RavenDB as described here. The default conventions are mostly satisfactory except that public properties without a setter are also serialized, which of course will throw an exception upon deserialization. It should also be noted that currently, RavenDB creates a new instance of the JsonSerializer when deserializing each individual document. This means that if a customized JsonContractResolver is provided, cache sharing must be enabled so that performance doesn’t suffer.

Another applicable axis of customization is overriding the type names stored for polymorphic properties in the $type field. By default, Json.NET will generate a type discriminator value which is the namespace and assembly qualified name of the type. While this gets the job done, a cleaner, more portable and compact approach is to store a context specific discriminator value. This can be done by overriding the default implementation of SerializationBinder as described here.

Refactoring

Serialization is also important in the context of refactoring. If domain entities are persisted directly then changing the name of a member, or the name or namespace of a type will break serialization. Changes to member names can be facilitated by maintaining a serialization attribute with a hard-coded field name or by using the Patching API to propagate the changes to the entire document set. The former breaks persistence ignorance while the latter can require a lot of processing time. Despite the issues, a document database facilitates a predominant degree of persistence ignorance as compared to their relational database counterparts.

Indexing

RavenDB indexes are defined through the IndexDefinition class. This class allows the declaration of one or more mapping functions and an optional reduce function with LINQ. It also allows the specification of a post query result transformation function as well as field sorting and indexing options. The map functions select properties from existing documents for indexing and only selected properties will be indexed and made queryable. The optional reduce function can be used to perform aggregation upon properties selected by the map functions providing functionality roughly equivalent to the group by clause.

Indexes can also be defined in a more strongly typed fashion by inheriting from AbstractIndexCreationTask as seen here. In this way, the map, reduce and transform functions are declared directly in the IDE instead of a raw string. For this project however, the strongly typed approach fell short because there was a requirement to index properties of polymorphic types as described here. As a result, I had to resort to the weakly typed IndexDefinition route.

The multi-map/reduce indexing functionality allows selection of properties from documents in multiple collections. This came in handy for this particular project because, as stated, the selection query needs to be able to filter by attributes of both products and pricing strategies. One way to achieve this is by storing the product and pricing strategy in a single document. In this way, the product is a value object of the corresponding pricing strategy, which is not problematic since products are already sourced from a different bounded context. However, given that products have a different life-cycle than the associated pricing strategies and for purely exploratory purposes, a similar outcome was achieved by using a multi-map index. This type of index contains multiple map functions which can select documents from multiple collections, products and pricing strategies in particular. The results are effectively “joined” in the reduce function. The declaration of a multi-map in this type of scenario can be a bit awkward. All of the map functions as well as the reduce function have to produce output having an identical shape. The product and pricing strategy documents contain entirely distinct properties and thus the corresponding map functions had to declare properties that not part of the source document as nulls. Moreover, the reduce function, after joining the documents on a common key which was the product ID, had to select the not-null properties from the group for the final output. Perhaps a better API for this particular scenario would be something similar to the live projections API:

By comparison, the live projections transform function is invoked at query time, not indexing time.

A caveat of RavenDB indexing is indexing speed. Indexes are updated in the background and the system remains responsive during the entire process, but generating the index for the selection query takes several hours and I’ve even experienced it taking over a day on occasion. I found that it is more performant to load all of the data before creating the indexes. This can affect the development, exploration and iteration workflow, because changes to indexes cannot be taken for granted. Generally, this is symptomatic of the shift from relational databases to document databases - no more ad hoc.

Optimization

RavenDB performs well out of the box but also provides several options for tweaking. For this project like many others, the biggest gain in performance was achieved with batching. As described, one of the use cases involved a selection query which was used to select the products for a price update. The query may select very large sets of products and strategies which would be queued for processing with NServiceBus. NServiceBus allows the batching of several messages into a single transport message. The transport message, which may contain multiple domain messages, is dequeued in a single unit of work and the individual domain messages are dispatched to the appropriate handlers. Andreas Ohlund described a unit of work adapter between NServiceBus and RavenDB which allows RavenDB to batch multiple update requests.

RavenDB includes were used in this project to batch requests to retrieve a product and the corresponding pricing strategy. Includes instruct RavenDB to retrieve a document together with an associated document which will be accessible through the same IDocumentSession. This is a great optimization because it happens behind the scenes and calling code doesn’t have to change.

Another performance gain was achieved by altering the transaction mode with the Raven/TransactionMode setting. Setting the mode to “Lazy” is faster but can result in data loss in case of a server crash. The RavenDB settings documentation also recommends that indexes should be stored on a drive different from the data, though the performance benefit was negligible in this particular project.

Conclusion

All project goals were met, including a sizable performance gain. The mapping between the database and the code changes from a object-relational mapping problem to a serialization problem which is much easier to manage. Furthermore, given Domain-Driven Design, the mapping between aggregate and document is one-to-one which also addresses some of the transactional consistency concerns associated with document databases. The querying capabilities of RavenDB via Lucene are beyond the capabilities of a relational database. Lack of support for anything resembling stored procedures ensures that no business logic resides in the database - only data. Severing the dependence on SQL Server is beneficial for cost, resource and maintenance constraints, which are especially pronounced in a cloud environment. The overall performance gain was three-fold for this particular project, although the testing methods weren’t meant to be precise.

The central “drawback” of document databases in general is the paradigm shift in terms of both data modeling as well as tooling and management. There is bigger pressure to design proper aggregate boundaries and understand querying requirements. This of course isn’t a bad thing.

As a noteworthy piece of trivia, the entire RavenDB client API, including 3rd party libraries, can almost fit on an old-school floppy disk at 1.44 MB as of build 960. For comparison, the NHibernate v3.2 DLL weighs in at 3.9 MB. Together with all required libraries, an NHibernate stack can easily surpass the size of the RavenDB server DLLs which amount to 5.86 MB. Simplicity prevails!

Comments