Skip to main content

Versioning Domain Entities with Ordinary SQL

In any system with mutable entities saved in an object store, it's inevitable that you'll want to audit the changes made to a particular entity. Explicitly versioning your entities solves this problem, and it's often pretty useful to end users of a sophisticated application, say for undo purposes. However, it's not always so clear how to transform a mutable entity into an versioned entity.

There are also a number of complications to versioning mutable objects:

  1. Handling parent-child relations such that a child change does not propagate to every parent
  2. Supporting efficient queries on history
  3. Supporting efficient queries for the newest version
  4. Space complexity of data representation
  5. Concurrent changes

Some of the above properties also have a natural tension, ie. supporting efficient queries often require storing more data so their space complexity is worse. I'm going to provide an overview here of a very simple schema for versioned entities with the following properties:

  1. Queries on history are just as efficient as queries on current data
  2. Queries on history have the same form as queries on current data, ie. no separate history table that must be handled specially -- in fact, you can use the exact same queries for both history and current data, thus eliminating all query duplication!
  3. Parents and children can evolve independently, ie. data update is "strictly disjoint access parallel" in transactional concurrency parlance: as long as two updates don't update the same object, they will both commit successfully

Schema

The schema changes required are quite straightforward for any entity with a primary key (which should be all of them!):

  1. Add a version number column of type int or long: this is a monotonically increasing version number, where version=0 is used when the object is first created, version=1 is after the first change is committed, etc.
  2. Ensure the PK index orders the version number in DESCENDING order, because the highest version number is the most current version
  3. Change the primary key to be composite over the existing PK and the new version number -- the original PK is now called the "logical id"
  4. Add a revisionDate column storing the date the version was committed
  5. Add an index over the logical id and the revisionDate, again with revisionDate in DESCENDING order

That's it! The original entity's PK is now a logical id rather than an actual instance id, so accessing a particular version of an entity now requires both the logical id and the version to access. The latest version is simply the entry with the largest version number.

Queries

Supposing we have an entity Foo whose PK is called "id", and :columnName is the syntax for a named query parameter. Ordinary queries on entities now become the following queries on versioned entities:

  • Insert a new entity:
    INSERT INTO Foo(id, ..., version, revDate)
    VALUES (:id, ..., 0, GETDATE())
  • Get the most current version:
    SELECT TOP 1 * FROM Foo
    WHERE id=:id
    ORDER BY version DESC
  • Get a version before a specific date/time:
    SELECT TOP 1 * FROM Foo
    WHERE id=:id AND revisionDate <= :rev
    ORDER BY version DESC
  • Update an entity:
    INSERT INTO Foo(id, ..., version, revDate)
    VALUES (:id, ..., :cur_version_plus_1, GETDATE())

Updates on entities are now inserts, so the entities at the data store level are now effectively immutable and the entity table is an append-only log of all changes to those entity types. Since the indexes are ordered so that the latest version is always first, the impact on accessing the most current version should be minimal. Accessing older versions are similarly easy, consisting of a query on the revisionDate column.

We use the revisionDate and not the version number when querying over history to easily support queries over more than one entity. Since each entity evolves its version independent of all others, the version numbers of two connected entities, say Foo[id=1] and Foo[id=2], are completely unrelated. The former might have version=999, and the latter version=1. The change that caused Foo[id=1] to increment its version from 998 to 999, could have also been the one to change Foo[id=2] from version 0 to 1. So if you look at Foo[id=1] at version=998, the only way to see the correct version of Foo[id=2] is to return the first version whose revisionDate is <= the Foo[id=1].revisionDate for version=998.

Fortunately, this makes the queries extremely simple, since a complicated query for a whole set of interconnected entities needs only a single date parameter. In other words, you can compose two independent queries using the same revisionDate parameter name via simple concatenation into a multiquery, and it just works.

Furthermore, your UI then only needs to pass around an optional revisionDate parameter to switch between queries on the latest data and queries on history. In fact, switching the query isn't even needed, because you can use the same query for both!

SELECT TOP 1 * FROM Foo
WHERE id=:id AND (:rev IS NULL OR revisionDate <= :rev)
ORDER BY version DESC

As promised, this query unifies the queries for the most current version, and any historical version of an entity. If the :rev parameter is set to NULL, then the most recent version is returned. The query optimizer can easily eliminate the redundant WHERE clause, so there should be no performance impact. If :rev is present, the redundant NULL check should be eliminated by the query optimizer and the last change before the given date/time would be returned. Alternately, you can simply always use the query with the revisionDate and simply pass in DateTime.Now, which will always return the latest result.

Concurrency

Being strictly disjoint access parallel, this schema on mutable entities is pretty much as good as it gets concurrency-wise. The entity table is essentially an append-only log of changes, and the new composite primary key ensures that any entity can undergo only one change event at a time. This ensures that any invariants in the application's domain model are preserved once committed to the entity store.

This is also the ideal optimistic concurrency model for any CRUD application.

Space Complexity

The space complexity is obviously sub-optimal since every version is stored. However, managing this space is a breeze, since you can easily write a query to delete all entity history before the current version, or keep the N latest versions, or basically any storage policy that makes sense for your business.

One serious concern are "large" columns in an entity, say an unbounded text column, or binary data. In this case, you should perform an additional refactoring of your schema to model git's way of doing things: add a Blob table that holds the data indexed by its hash, then just store the hash in the entity table. This keeps it compact, and ensures you only ever store one version of a binary blob, which you should probably be doing anyway.

Variations

There are a number of variations on this schema, like normalizing the revision date into a separate Revisions table to which you can add a userId to blame for the change, etc. But then the Revision table becomes a central bottleneck to insert performance and updates are no longer strictly disjoint access parallel. Furthermore, all select queries must now join on this Revisions table to find the revision date. I think the schema I've laid out provides the optimal performance tradeoff with little enough space overhead.

Versioned ORM

Standardizing this schema as the default operating mode for an ORM would be ideal. This sort of ubiquitous schema management should be automatic, and now that ORMs are moving to code-first designs where they manage the whole schema anyway, this should be a snap right?

The API interacting with the ORM remains largely unchanged, but all change submissions transparently become inserts, and all select queries automatically have the universal query form described above. A query on history would then simply be a query that's started with a date or version parameter, ie. instead of context.Foo.Where(...), it would be something like context.Foo.Before(DateTime?).Where(...).

Summary

I'm not the first to describe this schema for versioning, but I haven't seen a description quite so comprehensive, detailing the queries and its concurrency properties.

These days most of the buzz is around CQRS and event sourcing. This schema has some connections to event sourcing, since entity tables are now append-only logs with the entire history remaining intact. However, the schema of the entity change events aren't domain events as they are in event sourcing.

CRUD applications don't get much attention anymore, but I agree with Martin Fowler that most simple domains are still better served by a CRUD model than by CQRS, at least given our current tools. Domain entities usually define proper concurrency boundaries (and if they don't, your domain model is probably wrong!), so this schema is a natural fit that still ensures a domain model can naively enforce its invariants in ordinary code, and if those changes succeed then those invariants are preserved in the entity store.

Comments

Popular posts from this blog

async.h - asynchronous, stackless subroutines in C

The async/await idiom is becoming increasingly popular. The first widely used language to include it was C#, and it has now spread into JavaScript and Rust. Now C/C++ programmers don't have to feel left out, because async.h is a header-only library that brings async/await to C! Features: It's 100% portable C. It requires very little state (2 bytes). It's not dependent on an OS. It's a bit simpler to understand than protothreads because the async state is caller-saved rather than callee-saved. #include "async.h" struct async pt; struct timer timer; async example(struct async *pt) { async_begin(pt); while(1) { if(initiate_io()) { timer_start(&timer); await(io_completed() || timer_expired(&timer)); read_data(); } } async_end; } This library is basically a modified version of the idioms found in the Protothreads library by Adam Dunkels, so it's not truly ground bre

Building a Query DSL in C#

I recently built a REST API prototype where one of the endpoints accepted a string representing a filter to apply to a set of results. For instance, for entities with named properties "Foo" and "Bar", a string like "(Foo = 'some string') or (Bar > 99)" would filter out the results where either Bar is less than or equal to 99, or Foo is not "some string". This would translate pretty straightforwardly into a SQL query, but as a masochist I was set on using Google Datastore as the backend, which unfortunately has a limited filtering API : It does not support disjunctions, ie. "OR" clauses. It does not support filtering using inequalities on more than one property. It does not support a not-equal operation. So in this post, I will describe the design which achieves the following goals: A backend-agnostic querying API supporting arbitrary clauses, conjunctions ("AND"), and disjunctions ("OR"). Implemen

Easy Automatic Differentiation in C#

I've recently been researching optimization and automatic differentiation (AD) , and decided to take a crack at distilling its essence in C#. Note that automatic differentiation (AD) is different than numerical differentiation . Math.NET already provides excellent support for numerical differentiation . C# doesn't seem to have many options for automatic differentiation, consisting mainly of an F# library with an interop layer, or paid libraries . Neither of these are suitable for learning how AD works. So here's a simple C# implementation of AD that relies on only two things: C#'s operator overloading, and arrays to represent the derivatives, which I think makes it pretty easy to understand. It's not particularly efficient, but it's simple! See the "Optimizations" section at the end if you want a very efficient specialization of this technique. What is Automatic Differentiation? Simply put, automatic differentiation is a technique for calcu