IVO-CMS – Part 2 – The New

Any good system architecture is based on the concept of layering. A basic premise of layering is that one layer should not concern itself with the details of any other layer. With the proprietary CMS described in Part 1, my failure to realize that fact was the critical design and implementation flaw of the system. The implementation of the system is obsessed with the revision control part. However, with IVO-CMS, I’ve designed the content management system aspect to be ignorant of the revision control system. Think of it as a CMS wrapped in a revision control system. The entire CMS system itself can still function with a non-versionable implementation of the revision control system.

This is possible because the meat of the revision control system is simply a file system that stores blobs. Virtually any system can be designed with such an organization mechanism. The contents of the blobs and their relation to one another are what the design of the CMS is concerned with.

We’re no longer limited by the relational schema of a database. Our data structures are simply serialized to and from blobs stored in the revision controlled file system.

IVO-CMS uses XML as its serialization formation in its blobs. This is a natural choice because of the ability for an HTML5 document to be serialized as XML. HTML is our primary output for a web content management system, so a clean ability to manage and output it must be central to the design of the system.

IVO-CMS does not define any traditional CMS concepts at its core. Things like “page”, “content”, “navigation”, etc. are never mentioned. At its core, IVO-CMS is simply an HTML renderer with an extensible content processing engine.

The most basic concept at play is the blob, for lack of a better term. A blob is the recipe, written in XML, for rendering an HTML document fragment or even a complete document. IVO-CMS’s blob maps directly onto IVO’s blob.

The content rendering engine for IVO-CMS simply starts up a streaming XML reader on a blob and copies the XML elements read directly to the output, with a defined path for handling custom elements.

All custom processing elements of IVO-CMS start with ‘cms-’. The most basic processing elements built-in are:

  • cms-import
  • cms-import-template
    • cms-template
    • cms-template-area
    • cms-area
  • cms-scheduled
  • cms-conditional
  • cms-link

Any XML element that starts with ‘cms-’ is sent to a pluggable provider model to be parsed and handled.

Let’s start with cms-import. When a cms-import element is found in a blob, it should have the form <cms-import path=”/absolute/path/to/another/blob” /> or <cms-import path=”../relative/path/to/../../another/blob” />. Both absolute and relative paths are allowed to describe the location of the blob to import. The imported blob is sent through the content rendering engine and its output is directly injected into the output of the currently rendering blob. The relative path is relative to the currently rendering blob’s absolute path.

An imported blob must be a valid document fragment with one or many root elements that are fully closed. In other words, it cannot contain any unclosed elements which makes its usefulness in rendering partial HTML content limited. This is why cms-import-template was invented.

Think of cms-import-template as importing a template which has areas that can be overridden. This is analogous to the Page/Master Page concept of ASP.NET’s web forms. The page is the currently rendering blob and the master page is the imported template blob. Only certain blobs may be imported as templates – those that contain a single root element: cms-template. Unlike ASP.NET’s web forms, multiple templates may be imported into a single blob and templates may even import each other.

The cms-template blob may contain templateable areas with cms-template-area, uniquely identified with an ‘id’ attribute. The blob importing the template may override these template areas’ contents with a cms-area element and an ‘id’ attribute that matches the template. The order the cms-areas are defined in is important since all XML elements are processed in a streaming fashion and there is no back-filling of content.

Now we come to cms-scheduled. This is an element that allows part of a blob (or the entire thing, if so desired) to be rendered on a scheduled basis. It must first contain some <range from=”date” to=”date” /> elements that define the date ranges when the <content> element should be rendered. An <else> element may also be present to render content for when the current date/time does not fall into any of the date ranges.

Next up is the cms-conditional element which can primarily be used for selectively targeting content to specific audiences. It presents the content author with a simple system of if/else-if/else branching logic to determine which content to render for whichever audience. The inner elements are <if>, <elif>, and <else>. The attributes on the <if> and <elif> elements make up the conditional expressions.

The system evaluates the conditional expressions (a dictionary of key/value pairs pulled directly from the element attributes) to a single true/false value by using a “conditional provider” class. This class may be a custom implementation provided by the site implementer since it is best left up to him/her to define exactly how audiences may be defined and evaluated based on the user that the content should be rendered for.

However, that may be asking too much of the site implementer because it would potentially involve defining a domain-specific language for evaluating expressions. I may provide a default implementation that allows for defining complex boolean logic expressions, e.g. <if expr=”(role = ‘manager’) and (dept = ’23′)”>Hello, managers in dept 23!</if>. The values of variables ‘role’ and ‘dept’ would be provided by a provider model implementation that the site implementer could more easily develop.

Finally, the cms-link element is responsible for allowing the content author to simplistically create anchor links (i.e. the <a> tag) to other blobs without having to worry about the details of how the URL gets mapped to the referenced blob. This is primarily for SEO purposes so that you don’t have to force an implementation of your URL rewriting scheme into your site’s content. The site implementer can write a custom provider that takes the linked-to blob’s absolute path and rewrite it into a URL that should pull up that blob as its own page or as a wrapped article page or whatever other linking scheme he/she wishes to implement. This lets your content be internally consistent without worrying about URL details. Changing your SEO strategy for your content should be as simple as rewriting the link provider.

Now that we have the nitty-gritty details on how the content rendering engine and its basic processing elements work we can talk about how such a low-level engine can be integrated with an existing site, but that’s for next time!

IVO-CMS – Part 1 – Mistakes of the Past

For the first time, I’m starting a series of blog posts. This series will focus on a web content management system (IVO-CMS) that I’m currently designing and developing, partly out of curiosity to see if it can be done, partly out of the fun in implementing something new, and partly out of a need to correct my own mistakes of the past. This first post will explain the past mistakes in designing an existing proprietary CMS for my current employer; one you’ve never heard of, never seen, and never used due to its proprietary nature. I will tell you all this in order to give you context around how the design of the new system was influenced by the failures of the current one.

Let me start out by describing the general system architecture and then demonstrating the weaknesses of such a system. Ironically, this blog post is probably the most accurate documentation of this system’s architecture that currently exists.

Since the system allows each piece of content it tracks to be revision-controlled, I started out by designing a rather simplistic revision control system.

Each versionable object domain is represented in the database by a pair of tables: the “versioned root” table and the version table. What I have termed the “versioned root” is nothing more than a conceptually meaningless container that keeps track of the currently published version record by foreign key reference and is used to tie all the versioned records together by a common ancestor/container. Its basic schema is (VersionedRootID int PK, PublishedVersionID int FK, IsPublished bit).

The version table is what contains all of the versionable data that may change over time and have a history kept track of. Its basic schema is (VersionID int PK, VersionedRootID int FK, ProjectID int FK, all other data columns that are versionable).

For instance, let’s say we have XHTMLContentFragment (versioned root table) paired with XHTMLContentFragmentVersion (version table). The XHTMLContentFragmentVersion contains all of the useful system data and the versioned root table is there to tie all of those version records together and assign an identity to each versionable content fragment object.

A project is a container of versioned items; all versioned records point back to a specific ProjectID. Creating a project must be done first before creating any new versions. Think of it as a “changeset” that’s created ahead of time to associate a set of changes that are going to be applied simultaneously. A project is what gets published, and doing so simply bumps all of the PublishedVersionID FKs on the versioned root records up to the VersionedID that is contained in the to-be-published ProjectID. All of this is done in a transaction so as to be atomic. The idea being that the entire live site gets updated at once and that it all goes live or nothing goes live.

In my eagerness to implement this design, I neglected to account for a few implementation details.

Firstly, with this schema, a versionable item is not able to be deleted from the system in an atomic fashion by doing a publish. I did not implement an IsDeleted bit on the versioned root table to keep track of which objects are still alive. Furthermore, I failed to implement a companion table that records the publish history. There is no way to know what previous VersionID was published before the current PublishedVersionID.

Furthermore, the versioned root concept is flawed in that these containers are meaningless. They do not lend themselves to performing cross-system content merging. In fact, they actively hinder such a useful action. Imagine you have several database environments, like development, testing, staging, and production. Content changes in production would need to be merged down into development so that developers have the latest content to edit for system integration purposes (adding links to new features, etc.) while the production content maintainers can continue to make content changes to the live system.

The reason these containers are meaningless is primarily because of the choice to use auto-incrementing database identity values as primary keys. VersionedRootID 132 has no meaning to anyone except the actual versioned records that point back to it as their container. Its identity value 132 represents nothing semantically useful to the system. If I try to merge content from one system’s VersionedRootID 132 to another system’s VersionedRootID 132, that means nothing. I cannot know that those two containers are the same across the two independent systems.

Finally, the nail in the coffin is that I failed to track the parent VersionID when creating a new version. The system simply copies the content from the currently-published version record and makes a new version record with it then lets the user edit that. The lack of a recorded parent VersionID means that there is no way to tell what was published at the time of object creation. Projects may be created by any user of the system at any time and there may be multiple active projects being worked on simultaneously. Without an accurate history, there’s no way to figure out how to accurately merge this project’s change with another project’s change to the same versionable object. For instance, project B gets created based off published version A and project C gets created based off published version A as well. Let’s say that project B gets published first. This means project C is still based off published version A, and not version B. If project C gets published, it effectively overwrites changes that were made in project B. Without recording the parent VersionIDs, we can have, at best, a baseless merge.

In summary, with this CMS we have a weak history in that all the previous project versions are stored, but there’s no way to know in what order each project was published in. We also don’t know whether content from one project was accidentally overwritten with changes in a second project that was created before that first project was published but was then later published. There is no way to delete versioned items. There is no clean way to merge content across independent systems/environments because an accurate history is not kept and the container structure for the versioned items is meaningless.

All of these problems got me thinking about a better solution. That solution came, somewhat surprisingly, with git… but that’s for next time!

Immutable Versionable Objects (IVO) a.k.a. The Git Object Model

Git is a distributed, content-addressable and versionable file system that is primarily used as a revision control system for project source code. That’s great and all, but we’re not really interested in what git does, but more specifically how it does what it does; we’re interested in its internal object model. I started a simple C#(5) .NET 4 project that implements git’s internal object model as a reusable library.

Git

As a slight change of pace, I will now redirect you to first read Git For Computer Scientists (if you haven’t already) as a basic yet effective crash course on the git object model. This understanding is critical to understanding this blog post and by extension the purpose of the project I created. Seriously, go read it and come back here. I’ll wait.

You’re back now? Good. Do you understand what git’s object model is all about? In summary, it’s composed of a very small set of primitive objects that link together to form a very complicated and powerful revision-controlled file system.

The key concept of git is content-addressable objects, surprisingly. Content-addressable means that a hash is used as a unique key to identify an object based solely on its contents. The hash must be generated by concatenating all content stored in the object together in a predictable and consistent manner, and then running a well-known cryptographically-secure hashing function (such as SHA-1) over that concatenation. “Cryptographically secure” may not be the best name for such a requirement, but a hash function that produces a very good distribution and that maintains entropy of the data in its hash are the desirable qualities.

IVO

The project I have created is named IVO (pronounced /ee’-vho/), short for Immutable Versionable Objects. It is hosted on github here. Sorry, I’m bad at naming projects. I want to call everything either Orion or some combination of the names that make up the project’s uniqueness.

As I stated above, the purpose of this project is to implement git’s internal object model in a completely reusable and implementation-flexible way. The current implementation as I write this is based on persisting the object data in a SQL Server 2008 database (currently using SQL Express as a development environment). The API is defined with interfaces so any back-end provider (including regular filesystem storage like how git does it) can be implemented with ease. In fact, implementations should be cross-compatible with one another so you can dump a SQL database containing a repository into a set of flat files or vice versa.

Asynchrony and Immutability

Consistent with my latest kick for asynchronification, I’ve designed the IVO API do be entirely Task-based using the TPL (task parallel library; `System.Threading.Tasks`) part of the .NET 4 framework. The SQL Server 2008 back-end implementation executes all its database queries asynchronously using SqlConnection’s `BeginExecuteReader` / `EndExecuteReader` standard async pattern.

Hopefully, the Task-based API is easily degradable to synchronous execution where that is desired, but why would you not want it? :) I’m fairly certain it’s easier to downgrade asynchronous to synchronous than the inverse.

Consistent with the asynchronous nature of the API, and regardless of object model logical immutability, all CLR objects that represent part of the object model are immutable data structures in order to be asynchrony-friendly. Nested Builder objects are used to construct a mutable object as a work-in-progress (e.g. when being constructed from a persistence store) and is then converted to its parent immutable object type.

As for object model logical immutability, data is only ever added to the system in the form of new records (and records can be removed when proven orphaned). Data is never updated in-place except where that object model is considered logically mutable, e.g. a `ref`. As a general rule, all content-addressable objects are immutable and are never updated after their original creation.

Solutions

Once you fully understand the power of the object model implemented, you can solve all sorts of interesting problems that used to be considered hard, like implementing a content-management system, a remote filesystem synchronization system, a source control system, a document-tracking system, a historical record tracking system, and yes, even a revision controlled native file system driver (although probably not easily achievable via .NET)! The possibilities are nearly endless. What’s apparent to me is that this object model lends itself well to the design of systems that have to deal with revision control or synchronization, or a combination of the two. I’m sure there are many other problems that this can help solve that I’m not aware of.

You don’t even have to expose the gritty internal details of commits, trees, blobs and all that jazz to the end user. You can implement it transparently on their behalf. Then when it comes time to handle that complicated merge operation, you can present the user with “Oh look! I kept track of all your revisions! How would you like me to merge your changes for you?” It’s my opinion that the more a (savvy) user knows how a system is implemented the more he/she can be useful and take advantage of the system.

Distributed Workflow

A distributed workflow model is a virtually “free” benefit of this object model as well. You can have multiple users of your system working simultaneously and independently of one other. The only difficulty is in the implementation of a semantic diff/merge utility specific to the kind of data you’re working with. You will quickly find that you need to provide users the ability to merge their “branches” together to produce one common main-line branch that represents the accepted state of the system. To do that, you’ll need that diff/merge utility to handle combining independent changes. The object model can easily track your merge commit with multiple parent commitids, but how you handle merge the actual content is entirely up to you.

Summary

In summary, IVO is a framework for solving a larger problem that wants to handle revision control or synchronization in a powerful and flexible manner. I, myself, am writing a web content management system based off of it named, quite naturally, IVO-CMS. That will be the subject of future blog posts. I felt I should blog about IVO first as a platform for introducing IVO-CMS. I’m implementing and improving the two in tandem so IVO will get more API usability benefits as a result of implementing IVO-CMS along the way.

Feather: A new language based on asynchrony

I’ve spent most of my free time over the last few weeks in pursuit of designing a new programming language, one designed for asynchrony from the ground up. I call this language “Feather,” in the hope that it will be lightweight, simple, elegant, and just might possibly enable one to fly.

My core goals for this new language are:

  • asynchronous execution by default with explicit mechanisms to revert to synchronous execution.
  • immutable data and no ability to share mutable state between independent threads of execution.
  • static typing and complete type-safety.

Keeping this goal list very short will allow me to actually achieve all of these goals relatively easily with a final, working reference implementation of a compiler and runtime system.

Asynchronous execution:
This is the primary and most important goal of the language. Where possible, functions must be allowed to execute asynchronously. That is, the completion of one function need not depend on the completion of another function. This assumes execution independence of functions. Of course, this will not always be possible in every case, since one function may rely on the computed results of another function in order to complete, creating a dependency. There may also be times where execution of certain functions need to take place in a specific sequence, but in general I believe these to be special cases.

No shared, mutable state:
The main problem with allowing just any random set of functions to execute asynchronously with respect to one other has to do with shared mutable state. For instance, if just two functions can modify the same shared state and both are executing asynchronously without any synchronization between the two, the final effects on the shared state are undefined. The simplest solution to this problem is to disallow the sharing of mutable state between functions.

Combining these two goals as core to the language has never been done before in any programming language I’ve ever used. Sure, other languages support asynchrony and even fewer may support immutable data, but not to the extent of putting them at the core of the language’s design. The concept of defining data as immutable is not the same as forcing that mutable data cannot be shared across threads of execution.

I hope that defining these goals up front justifies the language’s existence enough for me to further purse its design and development. For once, though, I think I’ll end this post before it gets too lengthy. I have much more to talk about in regards to this language and its features than one post can contain. So, until next time!

RabbitMQ .NET client API is terrible (rant)

I’ve seen some pretty poorly-designed APIs in my time, but RabbitMQ’s .NET client API has got to be among the top worst that I’ve seen lately. Specifically, I’m talking about the RabbitMQ.Client library that appears to be officially sanctioned by the RabbitMQ project and “documented”.

The very first problem you’ll quickly see when you get into using this “API” is that everything (e.g. return values and parameters) is typed as a string. Strings are problem enough in and of themselves, so why exacerbate the problem by requiring them everywhere?

Let’s dive in to this API and look at a few examples in the IModel interface:

void ExchangeDeclare(string exchange, string type, bool durable, bool autoDelete, IDictionary arguments);

See that ‘type’ parameter? What does that mean? What is an exchange ‘type’ in this context? Can I create my own types? It turns out, no. This string value is expected to be one of (“direct”, “fanout”, “topic”). How do we know this from the method signature? We don’t! Not a clue. No, we have to pore over the user guide “documentation” to find that we’re expected to use one of the ExchangeType class’s const string members here.

Here would be a perfect candidate for an enum. This is essentially an untyped string “enum”, primarily used when calling ExchangeDeclare function. I haven’t done a full search of the API, but I’d want to say that this is probably the only place these values are used. Not to mention this ExchangeType “static” class isn’t even declared as static. Sure, you can new up an ExchangeType! What would that get you? Nothing. Not only did they do it wrong, they did it wrong the wrong way.

Continuing on, the ExchangeDeclare method returns `void`, like most methods in this API. This is also terrible in that it gives you absolutely no information back as to the success/failure status of the operation; that would be the bare minimum expectation for a reasonably dull API. What would be preferred is a more complex type system that propagates contextual information through its methods. Now there are some good situations to use void-returning methods in an API, but given this context, I don’t think this is one of them.

Why is the void-returning such a big deal? It gives you no information as to what operations should logically follow next after “declaring” an “exchange”. The API designer has expectations on how you, the user, should use this API and has documented elsewhere that the expected pattern of invocation is ExchangeDeclare, QueueDeclare, and QueueBind, but you would never be able to discover this yourself without significant trial-and-error at runtime.

The API itself should expose strong hints as to how it is designed to be used, not the documentation. Documentation should be used to clarify what and/or how a specific piece of functionality works/does. The API should scream at you (via its types) how it is to be used correctly. Documentation should scream at you how it should NEVER be used incorrectly.

What would have been nice is if this ExchangeDeclare method gave back some semblance of an Exchange type that has its own methods, maybe something like `QueueBinding BindQueue(Queue queue, …)`. That also implies there is a Queue type returned from QueueDeclare. All of this typing information should let you logically come to the conclusion that, “Oh, I need both a Queue and an Exchange declared first, and then the Exchange lets me bind queues to it!”

Instead, what we’re given is three seemingly independent methods defined on `IModel` (which is a terrible name for something that’s supposed to mean a channel, BTW) with void-returning semantics that accept only string-typed parameters and the order of operations is left undefined for you to guess at.

There is also absolutely no XML parameter documentation in this API either, which is more-or-less a de-facto standard in the C# world. Now, there are method <summary>s in the XML documentation, but the parameter documentation is entirely missing. Intellisense would have significantly helped out the discoverability of this API has there been any XML documentation on the parameters. Going back to ExchangeDeclare, at least a NOTE saying that this string ‘type’ parameter expects its values to come from the ExchangeType class would have been more useful than saying nothing whatsoever.

string QueueDeclare(string queue, bool durable, bool exclusive, bool autoDelete, IDictionary arguments);

Why would you use the pre-generics IDictionary type with absolutely no explanation as to what types it expects to see in the dictionary, what it is used for, or even if I can safely pass in a null reference if I don’t care right now about extra ‘arguments’? There are MUCH better strongly-typed ways of conveying optional information to an API programmatically than using an IDictionary. This should simply not be done, no matter if it be with the pre-generics framework types or with the generic ones (i.e. IDictionary<TKey, TValue>).

What does this method return? A string? Seriously? What do I do with that? What information is in this string? What other methods are looking for this particular string value? This return value is in its own domain, and I would hope obviously not to be confused with any other `string`-typed parameters in this terrible API. I should NOT have to run the program (in which I don’t even know if I’ve done anything right so far) in order to discover what different domains of string values there are and where each one is expected. This is the entire point of a typing system, whether it be dynamic or static. At least with a dynamic typing system there is some structure around your data. Here we just have strings, the lowest of the low.

void QueueBind(string queue, string exchange, string routingKey);

Now we have a new ‘routingKey’ string parameter. What exactly is the proper formatting of this key? Should I care? Are there any special characters reserved? Why not wrap this up in a RoutingKey struct, at the very least, that’s responsible for constructing only-valid routing keys? Nope, we’re completely on our own to create valid routing key values that should hopefully conform to the established AMQP standard’s definition of a routing key.

string QueueDeclare(string queue, bool durable, bool exclusive, bool autoDelete, IDictionary arguments);

We can declare queues! Great. Any restrictions on the queue name? Is that validated by this API? Again with the IDictionary crap? At least we have some bools now, and they’re not `string`s whose valid set of values would be (“true”, “TRUE”, “false”, “FALSE”) but not “True” or “False”. That last part was a joke, but I certainly wouldn’t be surprised if that were the convention here too. Why not be consistent and just STRING everything? That was also a joke. I hope you’re still with me here. Still, boolean parameters are generally frowned upon.

string QueueDeclarePassive(string queue);

Okay, now apparently there are ‘passive’ queues. I guess that passive queues cannot be defined as durable, exclusive, or autoDelete. What, no extra IDictionary of arguments this time? Does that mean I declare ‘active’ queues with QueueDeclare? What’s the difference?

All these strings everywhere bring me to another point: case-sensitivity. Is “MYQUEUE” different from “myQueue”? Are routing keys case-sensitive? Also, what if I want to use a non-ASCII character in my queue name, like “ಠ_ಠ”? Will that wreak havoc with this system? It seems this is the only face I’m able to make towards this API’s design. Hmm… what if I have some extra whitespace in my string? Are all string values Trim()ed, or just some, or none?

Summary:
I generally don’t like writing public rants, but it seems no one else has had anything negative to say about this API yet. I feel dumber having used this API. I award the designer no points, and may God have mercy on his/her soul. This makes me not want to use RabbitMQ at all.

RunQuery commandline tool

As a useful companion to LinqFilter (which I have given considerable attention to over the last few weeks), I put together a tool called RunQuery, which, you guessed it, runs SQL queries from the console… BUT (and here’s the kicker) it formats the results in an escaped, TAB-delimited format for direct use by LinqFilter.

For tool interoperability, I created (read: quite possibly reinvented) a useful, near-universal data interchange file format which uses TAB characters as column delimiters but also backslash escapes characters in each column to eliminate the possibility of actual TAB chars in the data interfering with the format. The backslash escaping rules are a subset of C#’s own string escaping rules. Also, null-valued columns are rendered as NUL (\0) chars in the format.

This data interchange format is accessible from LinqFilter via the SplitTabDelimited and JoinTabDelimited static methods. These methods are cloned in the RunQuery tool to ensure parity between the two tools. All of this makes it so we can easily exchange tabular data between tools using simple text pipes and/or files without having to worry about escaping, encoding, etc.

When you use RunQuery with default settings, it outputs two rows of header information. First, the column names, and second, the column types. All subsequent rows are column values. No row count is currently reported. There is no limit to number of rows returned either.

Aside from simply executing static SQL queries, RunQuery has a special “batch” mode which can execute a parameterized query whose parameter values are supplied via the stdin pipe. This mode expects the stdin lines to be fed in a particular order and formatting, namely the TAB-delimited encoding described above.

In this batch mode, the first line of input must be a row of parameter names where each column is a parameter name and must begin with a ‘@’ character. The next line of input is a row of columns that represent parameter types. Acceptable values are the enum members of SqlDbType.

After these first two rows of metadata are consumed and validated, all subsequent rows are treated as columns of parameter values. For each row, the query is executed against the given parameter values until EOF is reached.

Example:

$ cat > params
@n
int
1
2
3
^D

$ RunQuery -E -S myserver -d mydatabase -b -q ‘select @n’ < params
—————————————————————————
select @n
—————————————————————————

int
1
2
3

For example, batch mode may be used for streaming query results from one server to another for something simple and naive like a poor man’s data replication system or cross-database queries whose servers are not linked. One can also stream these parameter inputs from a LinqFilter query which may be processing input from a file or perhaps even consuming data from a web service. The possibilities are endless here.

Running RunQuery with no arguments displays some brief usage text.

LinqFilter Code Injection

Unless you’ve been living under a rock, you may have heard of such things as SQL Injection and XSS (cross-site scripting) attacks, just to name the popular forms of the attack. These types of attacks are language specific forms of the general “Code Injection” attack. These types of attacks happen when developers concatenate source code with unchecked user input and get to the stage where that fully trusted source code is eventually executed.

SQL Injection happens when a developer concatenates user input into a SQL Query but the user input may contain special characters like single-quotes which changes the meaning of the query. You could do this in a safe way by either properly escaping the raw user input or passing it in to the SQL Server out-of-band in a parameterized query. But I’m not here to blather on about SQL Injection attacks.

Today I’d like to talk about C# Code Injection, specifically as it applies to LinqFilter.

In LinqFilter, I naively concatenate what should be untrusted user input LINQ code directly into an implicity trusted C# code template embedded in the program. I invoke the C# compiler provider on that resulting string and execute the code in the compiled assembly. By concatenating unchecked user input into an existing code template, I’ve opened the door to C# Code Injection. The user has numerous ways of defeating the requirement that the query be a valid C# 3.5 expression.

One can easily terminate the expression with a ‘;’ and go on to write regular C# statements. LinqFilter will naively concatenate your code that it assumes is an expression into the code template and compile it. The compiler doesn’t care what happened so long as the code compiles as a valid C# program. This defeats the purpose (somewhat) of the LinqFilter tool so it’s generally not wise to do so and you’d only be hurting yourself, unless of course you’re executing someone else’s malicious LinqFilter scripts.

What I’d really like to do is put in a safety check to validate that the user input is indeed a C# 3.5 expression. This involves getting a parser for the C# 3.5 language, not an easy task. All I need to do is simply validate that the input parses correctly as an expression and only an expression. I don’t need to do any type checking or semantic analysis work; the compiler will handle that. I just need to know that the user can’t “escape out” of the expression boundaries.

Unfortunately, there are no prepackaged C# parsers that I could find. This means using one of the plethora of available parser-generator toolkits or writing my own. Neither options are sounding very attractive to me at the moment so I’ll probably just hold off entirely on the work since it provides no benefit to myself. I know how to use LinqFilter effectively and there’s clearly no community around it (yet) that are sharing scripts so there’s really no security problem. If someone else is concerned enough about the problem, I gladly welcome patches :) .

I think I’d also like to figure out how the CAS (code access security) feature works and incorporate that into the generated code so that queries are executed in a safe context and won’t be going out and destroying your files by invoking File.Delete() or something sinister, but this is a low priority as well.

C# Case Expression via Extension Methods

As a veteran C# developer yourself, I’m sure you’re familiar with the `switch` statement. Since it is a statement this means you cannot effectively use this ever-so-useful construct in an expression, such as in a LINQ query. This is a shame, and it irks me greatly that I have to resort to emulating the switch behavior with a series of chained ternary operators (a ? b : c ? d : e ? f : g …) in LINQ. Lucky for you, I am more susceptible to NIH than any man alive. I felt the need to investigate options in making a functional `case` expression for C#, or an equivalent cheap look-alike :) .

LINQ is great at expressing basic data transformation operations like joins, groupings, projections, etc., but it’s not so great at conditional processing. A foreach loop with a switch statement would be a better engine for this type of task, but frankly sometimes it’s just too tempting to start off by writing a LINQ query to get the job done. Using LINQ brings you the benefits of not having to worry about implementation details while also not increasing your bug surface area when you implement these basic transformations wrong.

Let’s look at an example naive query that needs to do conditional processing:

from line in lines
where line.Length > 0
let cols = line.Split('\t')
where cols.Length == 2
let row = new {
  // 'A' = added, 'D' = deleted, 'M' = modified
  operation = cols[0].Trim(),
  data = cols[1].Trim()
}
let added = "added " + row.data + " " + File.ReadAllLines(row.data).Length.ToString()
let deleted = "deleted " + row.data + " " + File.ReadAllLines(row.data).Length.ToString()
let modified = "modified " + row.data + " " + File.ReadAllLines(row.data).Length.ToString()
select (row.operation == "A") ? added
  : (row.operation == "D") ? deleted
  : (row.operation == "M") ? modified
  : String.Empty

This works but is wasteful in terms of processing since we’re always generating the `added`, `deleted`, and `modified` strings regardless of the input condition. The more dependent variables you introduce, the more waste the query has to generate and select out.

What we really want here is a switch statement, but a statement cannot belong in an expression. Expressions may only be composed of other expressions. Let’s see how this query transforms when I introduce my Case() extension method that I wrote for LinqFilter:

from line in lines
where line.Length > 0
let cols = line.Split('\t')
where cols.Length == 2
let row = new {
  // 'A' = added, 'D' = deleted, 'M' = modified
  operation = cols[0].Trim(),
  data = cols[1].Trim()
}
select row.operation.Case(
  () => String.Empty,                           // default case
  Match.Case("A", () => "added " + row.data + " " + File.ReadAllLines(row.data).Length.ToString()),
  Match.Case("D", () => "deleted " + row.data + " " + File.ReadAllLines(row.data).Length.ToString()),
  Match.Case("M", () => "modified " + row.data + " " + File.ReadAllLines(row.data).Length.ToString())
)

Through the use of generic extension methods and lambda expressions, we’re able to get a lot of expressibility here. This overload of the Case() extension method accepts first a default case lambda expression which will only be invoked when all other cases fail to match the source value, which in this case is `row.operation`’s value. What follows is a `params CaseMatch<T, U>[]` which is C# syntactic sugar for writing something along the lines of `new CaseMatch<T, U>[] { … }` at the call site.

These `CaseMatch<T, U>`s are small containers that hold the case match value and the lambda expression to invoke to yield the result of the case expression if the match is made. We use lambdas so that the expression to return is not evaluated until the match is made. This prevents unnecessary work from being done or causing side effects. Think of it as passing in a function to be evaluated rather than hard-coding an expression in the parameter to be evaluated at the call site of the Case() extension method. There are two generic arguments used: `T` and `U`. `T` represents the type you are matching on and `U` represents the type you wish to define as the result of the Case() method. Just because you are matching on string values doesn’t mean you always want to return a string value from your case expressions. :)

A small static class named `Match` was created which houses a single static method `Case` in order to shorten the syntax of creating `CaseMatch<T,U>` instances. Since static methods can use type inference to automagically determine your `T` and `U` generic types, this significantly shortens the amount of code you have to write in order to define cases. Otherwise, you would have to write `new CaseMatch<string, string>(“A”, () => “added” + row.data)` each time. Which looks shorter/simpler to you?

When you call the Case() extension method, the `CaseMatch<T,U>` params array is processed in order and each test value is compared against the source value which Case() was called on. If there is a match, the method returns the evaluated lambda for that case. There is no checking for non-unique test values, so if you repeat a case only the first case will ever receive the match. It is an O(n) straightforward algorithm and does no precomputation or table lookups.

Another overload of Case() is available for you to provide an `IEqualityComparer<T>` instance. This is a big win over the switch statement IMO, which to the best of my knowledge does not allow custom equality comparers to perform the matching logic and is limited to the behavior set forth by the C# language specification.

With this ability, you could specify `StringComparer.OrdinalIgnoreCase` in order to do case-insensitive string matching, something not easily/safely done with the switch statement. The ability to supply a custom IEqualityComparer<T> also opens up the set of possibilities for doing case matches on non-primitive types, like custom classes and structs that would not normally be able to be used in a switch statement.

In order to play with this extension method for yourselves, either browse my LinqFilter SVN repository and download the code from the LinqFilter.Extensions project (CaseMatch.cs and Extensions/CaseExtensions.cs), or download the LinqFilter release and play with it in your LINQ queries.

LinqFilter: Run LINQ Code From The Command Line Interface!

Having recently acquired a taste for using git on Windows with msysGit, I’ve been getting a lot more productive with my use of bash and other command-line tools in Windows. Shifting data around on the command line gets pretty hairy very quickly. Unfortunately, the basic set of Un*x utilities that process text data is just not powerful/flexible enough and usually each tool has some ridiculous custom syntax to learn, all of them different. I already know a  language powerful enough to process text efficiently, succinctly, and cleanly: LINQ! So I thought to myself, why not take advantage of LINQ to write simple little one-off text-processing scripts? Creating a new console application every time to handle this task becomes arduous, to say the least. Enter: LinqFilter!

For the impatient, you may download the latest release of the tool here (ZIP download). If this tool ever becomes popular enough, I have no problems hosting it elsewhere.

LinqFilter is, in a nutshell, a way to dynamically compile user-supplied C# v3.5 LINQ code and execute it instantly, sending the resulting item strings to Console.Out delimited by newlines or custom delimiters. An input IEnumerable<string> named lines is provided to allow the query to read lines from Console.In. There are many command-line options available to customize how LinqFilter behaves.

Let’s take the following example LINQ query:

LinqFilter -q "from line in lines select line"

This is a simple echo query. It will echo all lines read in from Console.In to Console.Out and it will do so in a streaming fashion. There is no storage of lines read in or written out. As a line comes in, it is run through the query and written out.

The “-q” command-line option appends a line of code to the ((QUERY)) buffer. You could supply multiple lines of code by supplying multiple “-q” options in order.

How does this work? The LinqFilter tool basically concatenates your query code into the following abbreviated class template:

public static class DynamicQuery {
    public static IEnumerable<string> GetQuery(IEnumerable<string> lines, string[] args) {
        ((PRE))
        IEnumerable<string> query = ((QUERY));
        ((POST))
        return query;
    }
}

The ((QUERY)) token is your query expression code. The ((PRE)) token is replaced with lines of C# code you supply in order to do one-time, pre-query setup and validation work. The ((POST)) token is replaced with lines of C# code you supply that takes effect after the query variable is assigned. This section is rarely used but is there for completeness.

As you can see, the query is enclosed in a simple static method that returns an IEnumerable<string>. The host console application supplies the lines from Console.In, but your query is not required to use that and can source data from somewhere else, or make up its own. :)

The args parameter is used to collect command-line arguments from the “-a <argument>” command-line option so that the query may be stored in a static file yet still use dynamic data passed in from the command line.

Let’s look at an example with a ((PRE)) section:

LinqFilter -pre "if (args.Length == 0) throw new Exception(\"Need an argument!\");" -q "from line in lines where line.StartsWith(args[0]) select line.Substring(args[0].Length)" -a "Hello"

Here we put in a full C# statement in the ((PRE)) section via the “-pre” command-line option to handle validation of arguments. The query itself is a simple filter to only return lines that start with args[0], i.e. “Hello”.

The best feature of the tool is the ability to store your queries off into a separate file and use the “-i” parameter to import them. Let’s leave that for another time.

In the mean time, I encourage you to download the tool and explore its immense usefulness. I must have written 30 or so one-off queries by now. I find new uses for it every day, which makes it a fantastic tool in my opinion and I’m very glad I took the time to write it. I hope you enjoy it and find it just as useful as I have!

P.S. – if you ever get lost, just type LinqFilter –help on the command line with no arguments and a detailed usage text will appear. :)

Immutable DataContract generator for Visual Studio 2008 T4

At a first glance, using WCF appears to limit one’s capabilities in working with immutable data structures, namely immutable data contracts (classes decorated with [DataContract] and [DataMember]) attributes. After some thought and a little experimentation, I came to a reasonable solution implemented in T4 where one can code-generate the immutable data structure and its related mutable class builder structure used to construct said immutable data contract instances.

First, let me demonstrate and explain a bit of the basic code pattern behind immutable data contracts before we move onto the T4 solution. Hopefully, the code generation opportunities may strike you as obvious as they did to me.

[DataContract(Name = "DataContract1")]
public class DataContract1
{
    [DataMember(Name = "ID")]
    private int _ID;
    [DataMember(Name = "Code")]
    private string _Code;
    [DataMember(Name = "Name")]
    private string _Name;

    private DataContract1() { }

    public int ID { get { return this._ID; } }
    public string Code { get { return this._Code; } }
    public string Name { get { return this._Name; } }

    public sealed class Mutable
    {
        public int ID { get; set; }
        public string Code { get; set; }
        public string Name { get; set; }
    }

    public static implicit operator DataContract1(Mutable mutable)
    {
        var imm = new DataContract1();
        imm._ID = mutable.ID;
        imm._Code = mutable.Code;
        imm._Name = mutable.Name;
        return imm;
    }
}

This is the very basic, bare-bones code for a fully-functional immutable data contract. WCF, requiring full-trust access, will serialize and deserialize to/from the private fields marked with the [DataMember] attributes at runtime using reflection in its DataContractSerializer.

Note that public instantiation of the immutable DataContract1 class is strictly forbidden as we only have a default private constructor. No one may access our private fields either. The only exposed public properties are read-only.

Now let’s look at the nested Mutable class. This is what you would normally expect to see a regular mutable DataContract implemented as, except it has no [DataContract] attribute on its class declaration nor any [DataMember] attributes decorated on its public properties, so we cannot use it as a WCF data contract (well, strictly speaking, we could, but it wouldn’t be recommended as it has virtually no metadata exposed for it).

The real interesting part is this implicit conversion operator defined in the main immutable DataContract1 class. This is a nary-used, albeit powerful feature of C#. One may implement an implicit or explicit conversion operator to allow two classes or structs to be convertible between one another (or only in one direction) using either an implicit cast-like behavior or an explicit one at the “call site”. A conversion operator operates much like a traditional type cast. You’ve seen these everywhere. It’s when you do `object a; string b = (string)a;` That (string)a expression is a unary (takes one operand) cast expression. It tells the compiler that you, as the developer, knows for absolute sure that the object stored in `a` is of type `string` and to store the string reference value in the `b` variable.

An explicit conversion operator is closest in appearance to the cast expression. The difference between implicit and explicit is that implicit does not require you to write a cast expression at the conversion site where the conversion is required.

In our case with the immutable (DataContract1) and mutable (DataContract1.Mutable) classes, our implicit conversion operator allows us to “convert” the mutable builder class into a new instance of the immutable class. This will appear completely transparent to the developer, which is a nice feature.

A usage example of this pattern would be as follows:

class Program
{
    static DataContract1 CreateDC1()
    {
        return new DataContract1.Mutable()
        {
            ID = 27,
            Code = "JSD",
            Name = "James S. Dunne"
        };
        // Notice that the `DataContract1.Mutable` reference in the return statement is implicitly converted to
        // the required `DataContract1` return type via the static implicit operator we defined.
    }

    static void Main(string[] args)
    {
        var dc1 = CreateDC1();
        WriteDC1(dc1);
    }

    static void WriteDC1(DataContract1 dc1)
    {
        Console.WriteLine(dc1.ID.ToString());
        Console.WriteLine(dc1.Code);
        Console.WriteLine(dc1.Name);

        dc1.Name = "Bob Smith";              // ERROR! Cannot modify because the property is read-only.
    }
}

This small program is demonstrating how one would use the Mutable nested builder class to construct an immutable instance from. Creating immutable instances of objects should be done once and only once. From there it is impossible to modify (mutate) them further, unless the types of your properties exposed are mutable themselves.

This immutability guarantee is therefore shallow, and it is up to you to guarantee deep immutability through all of your exposed types. For instance, if you expose a List<T> property via this immutable structure, other functions are free to modify your List by adding, removing, and modifying elements within it, but they cannot change the property to point to a wholly different List<T> reference. For this reason, it is recommended to make use of the .NET framework’s read-only immutable collection classes for exposing lists within your data contract.

As a matter of self-discipline, I avoid embedding List<T> and other collection types within my data contracts. I also keep my data contracts free of object references, thereby reducing the potential object graph to a degenerate case of a single object.

T4 Solution

Now that we see the basic immutable data contract pattern and how to use it, let’s explore opportunities to create a T4 code generation solution to make our lives easier so that we don’t have to type up all this boilerplate for our multitudes of data contracts that we have to expose over our service layer and use in our business logic.

[Immutable, DataContract(Name = "DataContract1")]
public partial class DataContract1
{
    /// <summary>
    /// The unique identifier of the data contract.
    /// </summary>
    [DataMember(Name = "ID")]
    private int _ID;

    [DataMember(Name = "Code")]
    private string _Code;

    [DataMember(Name = "Name")]
    private string _Name;
}

This is all the metadata and implementation code that the developer should be required to write in order to help the T4 generate the rest of the boilerplate. The T4 template will generate an output C# source file looking something like this:

public partial class DataContract1
{
    #region Constructors

    private DataContract1()
    {
    }

    public DataContract1(
        int ID,
        string Code,
        string Name
    )
    {
        this._ID = ID;
        this._Code = Code;
        this._Name = Name;
    }

    #endregion

    #region Public read-only properties

    /// <summary>
    /// The unique identifier of the data contract.
    /// </summary>
    public int ID
    {
        get { return this._ID; }
    }

    public string Code
    {
        get { return this._Code; }
    }

    public string Name
    {
        get { return this._Name; }
    }

    #endregion

    #region Mutable nested builder class

    /// <summary>
    /// A mutable class used to assign values to be ultimately copied into an immutable instance of the parent class.
    /// </summary>
    public sealed class Mutable : ICloneable
    {
        #region Public read-write properties

        /// <summary>
        /// The unique identifier of the data contract.
        /// </summary>
        public int ID { get; set; }
        public string Code { get; set; }
        public string Name { get; set; }

        #endregion

        #region ICloneable Members

        /// <summary>
        /// Clones this Mutable instance to a new Mutable instance using simple property assignment.
        /// </summary>
        public object Clone()
        {
            return new Mutable()
            {
                ID = this.ID,
                Code = this.Code,
                Name = this.Name,
            };
        }

        #endregion
    }

    #endregion

    #region Conversion operators

    /// <summary>
    /// Copies the values from the Mutable instance into a new immutable instance of this class.
    /// </summary>
    public static implicit operator DataContract1(Mutable mutable)
    {
        var imm = new DataContract1();
        imm._ID = mutable.ID;
        imm._Code = mutable.Code;
        imm._Name = mutable.Name;
        return imm;
    }

    /// <summary>
    /// Copies the values from this immutable instance into a new Mutable instance.
    /// </summary>
    public static explicit operator Mutable(DataContract1 immutable)
    {
        var mutable = new Mutable();
        mutable.ID = immutable._ID;
        mutable.Code = immutable._Code;
        mutable.Name = immutable._Name;
        return mutable;
    }

    #endregion
}

Notice that this is a lot more verbose that the abridged version posted for demonstration purposes at the top of this blog.

How does this all work? The T4 template makes use of EnvDTE, the namespace that Visual Studio exposes for doing in-editor code refactoring work, intended primarily for use by Visual Studio add-ins. The project does not have to be compiled in order for this to work.

I decided against using the Microsoft.Cci or regular reflection APIs since these require a compiled assembly in order to do their work. Since this pattern has to be implemented in a single type in a single assembly, neither of those are acceptable solutions. The alternative is to have the developer design their data contracts in some other format that could be parsed by the T4 template and have it generate all the code on behalf of the developer. Since this solution would raise the learning curve already high enough for immutable data structures, this is thrown out as well.

Having the developer code up part of the immutable data contract him/herself is beneficial in that the developer has some control over the class itself and is able to add methods, fields, and properties (with some restrictions) yet also gain the benefits of code generation provided by the T4 template.

Of course, to maintain true immutability, it doesn’t make much sense to have any publicly writable properties. The T4 template imposes some limitations like this on the existing partial class code. The T4 template also heavily relies on naming conventions so those are checked and enforced before generating a single line of code. T4 will generate descriptive and helpful warnings in your Error List pane if it detects anything is awry with the existing partial class code.

Download the code

My code for the T4 template is housed within the following public SVN repository:

Get it here (svn://bittwiddlers.org/WellDunne/trunk/public/ImmutableT4). Or browse the repository over the web from here.

The file is named ImmutablePartialsGenerator.tt and lives in the project folder named ImmutableT4 under the main solution folder. Browse the latest version here.

Note that the T4 template requires Visual Studio’s EnvDTE namespace to have a complete view of the code, so if it does not work within the first moments of opening the project, do not be surprised. Give it time to parse over your code and build a complete model and then try the T4 template again. This is a limitation of Visual Studio and not of this template.