Hash Array Mapped Tree

Hash Array Mapped Trie, or HAMT for short, is an interesting combination of a hash table and a trie. The definitive reference for HAMT is a publication of Phil Bagwell, Ideal hash trees. Search the web for this title if the link won't work anymore. I suggest you read it to find out how HAMT works, and give only a brief outline here.

HAMT is a tree of arrays (tries), each element of a trie contains a single key-value pair. Each trie, except a root trie, is limited to x elements in length. The value of x equals platform bitness, 32 for 32-bit platforms and 64 for 64-bit ones. Thus, log₂(x) bits of the hash value is required to address any element of a trie. I'll limit further explanation to the 32-bit case, which may easily be extended to 64-bit one. For 32-bit platform x is 5 bits. At the moment of key insertion, 32-bit hash value is calculated, which gives a chance to address [(32 - initial_bitness) / 5] <= 6 levels of the tree without a need for a new hash value.

The elements of each trie are ordered based on the value of 5-bit hash pieces, where each piece was used as the index into the trie at the corresponding level (index may not be bigger than 31). Since the size of the root trie may differ from 32 (it's still limited to the powers of 2), insertion at the first tree level may require a part of hash value, different from 5 bits. If a trie element at the obtained index is empty, insertion is completed. Otherwise, the next 5 bits of the hash value are taken, a new trie is added as a child of the current one if not created yet, and insertion continues. If the hash value is exhausted (after 6 or less levels of tree walking), a new one must be obtained.

The tries are stored in compressed form, so that empty elements do not consume memory. For compression, a bitmap value is stored in each element to which a sub-trie is linked, each bit of this value indicates whether the sub-trie element at the corresponding index is taken or not. For example, the bitmap

...10100

indicates that there are non-empty elements at positions 2 and 4 in the subtrie of this node. This permits to store elements contiguously, and use a bit-counting function to obtain index into this contiguous array.

HAMT requirements

HAMT meets most of the map requirements, as defined by the C++ standard, and its public interface matches closely the one of std::hash_map. There are also a few differences, mentioned below:

There are no key_comp and value_comp methods defined, since there's no key ordering in HAMT. Instead, hash_funct and key_eq methods are defined to return the hash function object and key equality comparison object correspondingly. This behaviour matches STLPort implementation of hash_map. However, hash_map from STL bundled with Microsoft VS.NET implements both missing methods, using std::less as a default class for key comparison.
Iterating HAMT obtains elements in no particular order. This behavior is similar to std::hash_map's one. Furthermore, it's not possible to restore the key order without making a full copy of the iterated container.
Inserting an element into HAMT may invalidate the previously obtained iterators. This behaviour is controlled by the HAMT traits though, thus you can overcome this limitation.
HAMT iterators are bi-directional as required by the C++ standard. This matches Microsoft VS.NET implementation of std::hash_map. However, STLPort only implements forward iterators for this container.

HAMT iterators

HAMT implements bidirectional iterators. HAMT iterator only exists inside the [begin(), end()] range. That is, an attempt to do ++i, where i == end(), or --i, where i == begin(), will result in undefined behavior. Constructing a new iterator is a time-constant operation. Because of the caching scheme used by the HAMT storage implementation, it's possible that the iterator that would become invalid in std::map will stay valid in HAMT. For example, after erase() operation, the HAMT iterator will possibly point to the element of the same subtrie following immediately the one being erased.

Hash function

HAMT comes with three different hash functions, or hashers. The first one is a variant of a widely used, very fast hasher, which gives reasonable results for the number of keys below 1000.

size_t hash_fast(_Char key)
{
	size_t h = 0; 
	for ( ; *key; ++key) h = 5 * h + *key;

	return h;
}

The second hasher, used by default, is a high-quality rotating hash function, described by Robert J. Jenkins Jr. After some experiments with the other hashers, I found that this one yields very good hash distribution being only marginally slower than hash_fast.

Finally, the third hasher is an implementation of Pearson lookup algorithm with a randomly chosen lookup table.