Skip to main content

Immutable Hash Array Mapped Trie in C#

I just completed an implementation of an immutable hash array mapped trie (HAMT) in C#. The HAMT is an ingenious hash tree first described by Phil Bagwell. It's used in many different domains because of its time and space efficiency, although only some languages use the immutable variant. For instance, Clojure uses immutable HAMTs to implement arrays/vectors which are essential to its concurrency.

The linked implementation is pretty much the bare minimum supporting add, remove and lookup operations, so if you're interested in learning more about it, it's a good starting point. Many thanks also to krukow's fine article which helped me quickly grasp the bit-twiddling needed for the HAMT. The tree interface is basically this:

/// <summary>
/// An immutable hash-array mapped trie.
/// </summary>
/// <typeparam name="K">The type of keys.</typeparam>
/// <typeparam name="T">The type of values.</typeparam>
public class Tree<K, T> : IEnumerable<KeyValuePair<K, T>>
{
    /// <summary>
    /// The number of elements in the tree.
    /// </summary>
    public virtual int Count { get; }

    /// <summary>
    /// Find the value for the given key.
    /// </summary>
    /// <param name="key">The key to lookup.</param>
    /// <returns>
    /// The value corresponding to <paramref name="key"/>.
    /// </returns>
    /// <exception cref="KeyNotFoundException">
    /// Thrown if the key is not found in this tree.
    /// </exception>
    public T this[K key] { get; }

    /// <summary>
    /// Add the given key-value pair to the tree.
    /// </summary>
    /// <param name="key">The key.</param>
    /// <param name="value">The value for the given key.</param>
    /// <returns>A tree containing the key-value pair.</returns>
    public Tree<K, T> Add(K key, T value);

    /// <summary>
    /// Remove the element with the given key.
    /// </summary>
    /// <param name="key">The key to remove.</param>
    /// <returns>
    /// A tree without the value corresponding to
    /// <paramref name="key"/>.
    /// </returns>
    public Tree<K, T> Remove(K key);
}

No benchmarks yet, it's still early stages. The implementation is based on a few functions from my Sasa class library, primarily some bit-twiddling functions from Sasa.Binary.

The whole implementation is literally 200 lines of code, excluding comments. The only deficiency of the current implementation is that it doesn't properly handle hash collisions. A simple linear chain on collision would be a simple extension. I also need an efficient tree merge operation. I was initially implementing Okasaki's Patricia trees because of their efficient merge, but HAMTs are just so much better in every other way. If anyone has any pointers to efficient merging for HAMTs, I'd be much obliged!

State of Sasa v0.9.4

Sasa itself is currently undergoing an aggressive reorganization in preparation for the 0.9.4 release. A lot of the optional abstractions are moving from Sasa core into their own assemblies. A lot of the useful abstractions are relatively stand-alone. It currently stands as follows, with dependencies listed between []:

Production Ready

  • Sasa [standalone]: tuples, option types that work with both types and structs, string extensions, IEnumerable extensions, thread and null-safe event extensions, type-safe enum extensions, lightweight type-safe wrappers for some system classes, eg. WeakReference and Delegate, extensions for code generation and debugging, and generic number extensions. The goal is to provide only essential extensions to address deficiencies in the system class libraries.
  • Sasa.Binary [standalone]: low-level bit twiddling functions, endian conversions, and portable BinaryReader and BinaryWriter.
  • Sasa.Collections [Sasa, Sasa.Binary]: efficient immutable collections library, including purely functional stacks, queues, lists, and trees. Tree needs some more testing obviously, since it's a rather new addition.
  • Sasa.Mime [standalone]: a simple library encapsulating standard media types and file extensions. It also provides an interface for extending these associations at runtime.
  • Sasa.Statistics [standalone]: a few basic numerical calculations, like standard deviation, and Pierce's criterion used to remove outliers from a data set.
  • Sasa.Net [Sasa, Sasa.Collections]: MIME mail message parsing to System.Net.Mail.MailMessage (most libraries provide unfamiliar, custom mail and attachment classes), a POP3 client, and RFC822 header parsing.
  • Sasa.Contracts [Sasa]: I've used the runtime preconditions subset of the standard .NET contracts for years. I haven't gotten around to adding postconditions and invariants support to ilrewriter.
  • ilrewriter.exe [Sasa]: the IL rewriter currently only erases Sasa.TypeConstraint<T> from your code, which allows you to specify type constraints that C# normally disallows, ie. T : Delegate, or T : Enum.

Beta

These abstractions work, but haven't seen the production use or stress testing the above classes have.

  • Sasa.TM [Sasa]: software transactional memory, super-fast thread-local data (much faster than ThreadLocal<T>!).
  • Sasa.Reactive [Sasa]: building on Rx.NET, this provides Property<T> which is a mutable, reactive cell with a getter/setter. Any changes automatically propagate to observers. NamedProperty<T> inherits from Property<T> and further implements INotifyPropertyChanged and INotifyPropertyChanging.
  • Sasa.Parsing [Sasa]: implements a simple, extensible Pratt parser. Grammars are generic and can be extended via standard inheritance. The test suite is extensive, although I've only use this in private projects, not production code.
  • Sasa.Linq [standalone]: base classes for LINQ expression visitors and query providers. Not too uncommon these days, but I've had them in Sasa for many years.

Currently Broken

These assemblies are undergoing some major refactoring, and are currently rather broken.

  • Sasa.Dynamics [Sasa, Sasa.Collections]: blazingly fast, type-safe runtime reflection. This code underwent significant refactoring, and I recently realized that the patterns being used here could be abstracted even further by providing multiple dispatch for .NET. See the multiple-dispatch branch of the repository for the current status of that work. This should be complete for Sasa v0.9.4.
  • Sasa.Serialization [Sasa, Sasa.Dynamics]: a compact, fast serializer based on Sasa.Dynamics. Waiting on the completion of Sasa.Dynamics.

Deprecated

These assemblies are now deprecated, either because they saw little use, were overly complex, or better alternatives now exist.

  • Sasa.Concurrency [Sasa]: primarily an overly complex implementation of futures based on Alice ML semantics. In a future release, Sasa.Concurrency will strip out futures, absorb Sasa.TM, and also provide a deterministic concurrency library based on concurrent revision control, which I believe to be inherently superior to STM.

Toy

These assemblies are not really meant for serious use, primarily because they don't fit with standard .NET idioms.

  • Sasa.FP [Sasa, Sasa.Collections]: some less useful functional idioms, like delegate currying and tupling, binomial trees, trivial immutable sets, and either types.
  • Sasa.Arrow [Sasa]: a convenient interface for arrows. Definitely not a idiomatic .NET!
  • Sasa.Linq.Expressions [Sasa]: extensions to compose LINQ expressions. Also provides some typed expression trees, as opposed to the standard untyped ones. In theory it should work, and the code all type checks, but pretty much 0 testing at the moment.

Comments

Popular posts from this blog

async.h - asynchronous, stackless subroutines in C

The async/await idiom is becoming increasingly popular. The first widely used language to include it was C#, and it has now spread into JavaScript and Rust. Now C/C++ programmers don't have to feel left out, because async.h is a header-only library that brings async/await to C! Features: It's 100% portable C. It requires very little state (2 bytes). It's not dependent on an OS. It's a bit simpler to understand than protothreads because the async state is caller-saved rather than callee-saved. #include "async.h" struct async pt; struct timer timer; async example(struct async *pt) { async_begin(pt); while(1) { if(initiate_io()) { timer_start(&timer); await(io_completed() || timer_expired(&timer)); read_data(); } } async_end; } This library is basically a modified version of the idioms found in the Protothreads library by Adam Dunkels, so it's not truly ground bre

Building a Query DSL in C#

I recently built a REST API prototype where one of the endpoints accepted a string representing a filter to apply to a set of results. For instance, for entities with named properties "Foo" and "Bar", a string like "(Foo = 'some string') or (Bar > 99)" would filter out the results where either Bar is less than or equal to 99, or Foo is not "some string". This would translate pretty straightforwardly into a SQL query, but as a masochist I was set on using Google Datastore as the backend, which unfortunately has a limited filtering API : It does not support disjunctions, ie. "OR" clauses. It does not support filtering using inequalities on more than one property. It does not support a not-equal operation. So in this post, I will describe the design which achieves the following goals: A backend-agnostic querying API supporting arbitrary clauses, conjunctions ("AND"), and disjunctions ("OR"). Implemen

Easy Automatic Differentiation in C#

I've recently been researching optimization and automatic differentiation (AD) , and decided to take a crack at distilling its essence in C#. Note that automatic differentiation (AD) is different than numerical differentiation . Math.NET already provides excellent support for numerical differentiation . C# doesn't seem to have many options for automatic differentiation, consisting mainly of an F# library with an interop layer, or paid libraries . Neither of these are suitable for learning how AD works. So here's a simple C# implementation of AD that relies on only two things: C#'s operator overloading, and arrays to represent the derivatives, which I think makes it pretty easy to understand. It's not particularly efficient, but it's simple! See the "Optimizations" section at the end if you want a very efficient specialization of this technique. What is Automatic Differentiation? Simply put, automatic differentiation is a technique for calcu