Immutable Versionable Objects (IVO) a.k.a. The Git Object Model

Git is a distributed, content-addressable and versionable file system that is primarily used as a revision control system for project source code. That’s great and all, but we’re not really interested in what git does, but more specifically how it does what it does; we’re interested in its internal object model. I started a simple C#(5) .NET 4 project that implements git’s internal object model as a reusable library.

Git

As a slight change of pace, I will now redirect you to first read Git For Computer Scientists (if you haven’t already) as a basic yet effective crash course on the git object model. This understanding is critical to understanding this blog post and by extension the purpose of the project I created. Seriously, go read it and come back here. I’ll wait.

You’re back now? Good. Do you understand what git’s object model is all about? In summary, it’s composed of a very small set of primitive objects that link together to form a very complicated and powerful revision-controlled file system.

The key concept of git is content-addressable objects, surprisingly. Content-addressable means that a hash is used as a unique key to identify an object based solely on its contents. The hash must be generated by concatenating all content stored in the object together in a predictable and consistent manner, and then running a well-known cryptographically-secure hashing function (such as SHA-1) over that concatenation. “Cryptographically secure” may not be the best name for such a requirement, but a hash function that produces a very good distribution and that maintains entropy of the data in its hash are the desirable qualities.

IVO

The project I have created is named IVO (pronounced /ee’-vho/), short for Immutable Versionable Objects. It is hosted on github here. Sorry, I’m bad at naming projects. I want to call everything either Orion or some combination of the names that make up the project’s uniqueness.

As I stated above, the purpose of this project is to implement git’s internal object model in a completely reusable and implementation-flexible way. The current implementation as I write this is based on persisting the object data in a SQL Server 2008 database (currently using SQL Express as a development environment). The API is defined with interfaces so any back-end provider (including regular filesystem storage like how git does it) can be implemented with ease. In fact, implementations should be cross-compatible with one another so you can dump a SQL database containing a repository into a set of flat files or vice versa.

Asynchrony and Immutability

Consistent with my latest kick for asynchronification, I’ve designed the IVO API do be entirely Task-based using the TPL (task parallel library; `System.Threading.Tasks`) part of the .NET 4 framework. The SQL Server 2008 back-end implementation executes all its database queries asynchronously using SqlConnection’s `BeginExecuteReader` / `EndExecuteReader` standard async pattern.

Hopefully, the Task-based API is easily degradable to synchronous execution where that is desired, but why would you not want it? :) I’m fairly certain it’s easier to downgrade asynchronous to synchronous than the inverse.

Consistent with the asynchronous nature of the API, and regardless of object model logical immutability, all CLR objects that represent part of the object model are immutable data structures in order to be asynchrony-friendly. Nested Builder objects are used to construct a mutable object as a work-in-progress (e.g. when being constructed from a persistence store) and is then converted to its parent immutable object type.

As for object model logical immutability, data is only ever added to the system in the form of new records (and records can be removed when proven orphaned). Data is never updated in-place except where that object model is considered logically mutable, e.g. a `ref`. As a general rule, all content-addressable objects are immutable and are never updated after their original creation.

Solutions

Once you fully understand the power of the object model implemented, you can solve all sorts of interesting problems that used to be considered hard, like implementing a content-management system, a remote filesystem synchronization system, a source control system, a document-tracking system, a historical record tracking system, and yes, even a revision controlled native file system driver (although probably not easily achievable via .NET)! The possibilities are nearly endless. What’s apparent to me is that this object model lends itself well to the design of systems that have to deal with revision control or synchronization, or a combination of the two. I’m sure there are many other problems that this can help solve that I’m not aware of.

You don’t even have to expose the gritty internal details of commits, trees, blobs and all that jazz to the end user. You can implement it transparently on their behalf. Then when it comes time to handle that complicated merge operation, you can present the user with “Oh look! I kept track of all your revisions! How would you like me to merge your changes for you?” It’s my opinion that the more a (savvy) user knows how a system is implemented the more he/she can be useful and take advantage of the system.

Distributed Workflow

A distributed workflow model is a virtually “free” benefit of this object model as well. You can have multiple users of your system working simultaneously and independently of one other. The only difficulty is in the implementation of a semantic diff/merge utility specific to the kind of data you’re working with. You will quickly find that you need to provide users the ability to merge their “branches” together to produce one common main-line branch that represents the accepted state of the system. To do that, you’ll need that diff/merge utility to handle combining independent changes. The object model can easily track your merge commit with multiple parent commitids, but how you handle merge the actual content is entirely up to you.

Summary

In summary, IVO is a framework for solving a larger problem that wants to handle revision control or synchronization in a powerful and flexible manner. I, myself, am writing a web content management system based off of it named, quite naturally, IVO-CMS. That will be the subject of future blog posts. I felt I should blog about IVO first as a platform for introducing IVO-CMS. I’m implementing and improving the two in tandem so IVO will get more API usability benefits as a result of implementing IVO-CMS along the way.