Introduction

std::unordered_map is C++’s hash table container, providing amortized \(O(1)\) constant-time search, insertion and deletion operations with unique keys on average. In this blog post, I explore the basic API of unordered_map.

Construction

std::unordered_map is a class template in the header <unordered_map>.

template<typename Key, typename Value, typename HashFunc, typename KeyEqual, typename Alloc>
class unordered_map;

It allows you specify a custom hash function hash, equal function to compare keys and alloc as a custom allocator.

#include <unordered_map>
#include <print>
#include <cstddef>
#include <vector>

int main(){
    // Default construction
    std::unordered_map<std::string, uint32_t> word_count;

    // Initializer list construction
    std::unordered_map<std::string, std::vector<double>> symbol_to_price_levels_map 
    {
        {"AAPL", {100.0, 101.0, 102.0}},
        {"GOOG", {1500.0, 1505.0, 1510.0}}
    };

    // with bucket count hint
    // atleast 100 buckets
    std::unordered_map<std::string, uint32_t> map(100);

    // Range construction (accepts a pair of iterators)
    std::unordered_map<std::string, std::vector<double>> word_count_copy(
        word_count.begin(), word_count.end()
    );

    // copy construction
    std::unordered_map<std::string, std::string> cache{
        {"Fedor Pikus", "Performance GOAT"},
        {"Herb Sutter", "C++ Visionary"},
        {"John Carmack", "Legendary programmer"}
    };

    std::unordered_map<std::string, std::string> cache_copy(cache);

    // move construction
    std::unordered_map<std::string, uint32_t> word_count2{std::move(word_count)};

    // with a custom hash function
    // Knuth's multiplicative hash
    struct CustomHash{
        size_t operator()(int key) const{
            return key * 2654435761;
        }
    };

    std::unordered_map<std::string, uint32_t, CustomHash> map_with_custom_hash;

    // with a custom hash and custom equality
    struct Point{
        double x;
        double y;

        bool operator==(const Point& other) const{
            return x == other.x && y == other.y;
        }
    };

    auto point_equal = [](const Point& a, const Point&b) -> bool{
        return a.x == b.x && a.y == b.y;
    };

    auto point_hash = [](double x, double y){
        std::size_t h1 = std::hash<double>{}(x);
        std::size_t h2 = std::hash<double>{}(y);
        return h1 ^ (h2 << 1);
    };

    std::unordered_map<Point, std::size_t, 
        decltype(point_hash), decltype(point_equal)> points_map;

    // from a vector of pairs
    return 0;
}

Compiler Explorer

Insertion

#include <unordered_map>
#include <print>
#include <cstddef>
#include <vector>
#include <utility>

int main(){
    std::unordered_map<int, std::string> map;

    // array index operator
    map[1] = "Carl Gauss";
    map[2] = "Augustin Louis Cauchy";

    // insert(pair(k,v))
    // returns pair<iterator,bool>. bool is true
    // if the insert succeeds, else false
    auto [it1, success1] = map.insert(std::make_pair( 3, "Joseph Fourier" ));
    auto [it2, success2] = map.insert(std::make_pair( 4, "Josepph Lagrange"));
    
    // void insert(initializer_list)
    // insert from an initializer list
    map.insert(
    {
        {5, "Riemann"},
        {6, "Ada Lovelace"}
    });
    
    // emplace(Args&& ...)
    // inserts a new element into the container constructed
    // in-place with the given args, if there is no element
    // with the key in the container
    auto [it3, success3] = map.emplace(7, "Leonard Euler");

    // The new element may be constructed even if there already
    // is an element with the key in the container, in which case
    // the newly constructed element will be destroyed immediately
    int k {7};
    std::string v { "David Hilbert" };
    std::println("emplace({}, {}) called", k, v);

    auto [it4, success4] = map.emplace(k, v);
    std::println("success flag : {}", success4);

    // try_emplace(Key&&, Args&&)
    // if the key k already exists in the container, do nothing
    // Otherwise, insert a new element into the container
    // with key k and value constructed with args
    auto [it5, success5] = map.try_emplace(8, "David Hilbert");

    // range insert - takes a pair of iterators
    std::vector<std::pair<int, std::string>> vec{
        {9, "Andrew Kolmogorov"},
        {10, "Emile Borel"}
    };

    map.insert(vec.begin(), vec.end());

    // insert_range(R&& rg)
    // Inserts a copyn of each element in the range rg 
    // if and only if each of the keys are not 
    // already present in the map
    const auto rg = {std::pair{11, "Guido Cantelli"}, {12, "Terrence Tao"}};
    map.insert_range(rg);

    // insert_or_assign(key, obj)
    // If a key k already exists in the container, assigns
    // obj to the key k. If the key k does not exist, 
    // it inserts the new value as if by insert.
    auto [it6, success6] = map.insert_or_assign(12, "William Feller");
    auto [it7, success7] = map.insert_or_assign(13, "Andrew Wiles");
    
    return 0;
}

Compiler Explorer

Search

#include <unordered_map>
#include <print>
#include <cstddef>
#include <vector>
#include <utility>

struct PhoneRecord{
    std::string full_name;
    std::string address;
    std::string phone;
    std::string pin;
};
int main(){
    std::unordered_map<std::string, PhoneRecord> phone_book;
    phone_book.insert({
        {"John Smith", {"John Smith", "123 Main Street, New York, NY 10001", "(212) 555-1234", "1234"}},
        {"Mary Johnson", {"Mary Johnson", "456 Oak Avenue, Los Angeles, CA 90001", "(213) 555-2345", "2345"}},
        {"Robert Williams", {"Robert Williams", "789 Pine Road, Chicago, IL 60601", "(312) 555-3456", "3456"}},
        {"Patricia Brown", {"Patricia Brown", "321 Maple Drive, Houston, TX 77001", "(713) 555-4567", "4567"}},
        {"Michael Jones", {"Michael Jones", "654 Cedar Lane, Phoenix, AZ 85001", "(602) 555-5678", "5678"}},
        {"Linda Garcia", {"Linda Garcia", "987 Elm Court, Philadelphia, PA 19101", "(215) 555-6789", "6789"}},
        {"David Miller", {"David Miller", "147 Birch Place, San Antonio, TX 78201", "(210) 555-7890", "7890"}},
        {"Barbara Davis", {"Barbara Davis", "258 Walnut Boulevard, San Diego, CA 92101", "(619) 555-8901", "8901"}},
        {"Richard Rodriguez", {"Richard Rodriguez", "369 Cherry Way, Dallas, TX 75201", "(214) 555-9012", "9012"}},
        {"Susan Martinez", {"Susan Martinez", "741 Hickory Circle, San Jose, CA 95101", "(408) 555-0123", "0123"}},
        {"Joseph Hernandez", {"Joseph Hernandez", "852 Willow Terrace, Austin, TX 78701", "(512) 555-1235", "1235"}},
        {"Jessica Lopez", {"Jessica Lopez", "963 Poplar Parkway, Jacksonville, FL 32099", "(904) 555-2346", "2346"}},
        {"Thomas Gonzalez", {"Thomas Gonzalez", "159 Sycamore Trail, Fort Worth, TX 76101", "(817) 555-3457", "3457"}},
        {"Sarah Wilson", {"Sarah Wilson", "357 Ash Street, Columbus, OH 43085", "(614) 555-4568", "4568"}},
        {"Charles Anderson", {"Charles Anderson", "486 Magnolia Avenue, Charlotte, NC 28201", "(704) 555-5679", "5679"}},
        {"Karen Thomas", {"Karen Thomas", "753 Dogwood Road, San Francisco, CA 94101", "(415) 555-6780", "6780"}},
        {"Christopher Taylor", {"Christopher Taylor", "951 Redwood Drive, Indianapolis, IN 46201", "(317) 555-7891", "7891"}},
        {"Nancy Moore", {"Nancy Moore", "246 Sequoia Lane, Seattle, WA 98101", "(206) 555-8902", "8902"}},
        {"Daniel Jackson", {"Daniel Jackson", "135 Fir Court, Denver, CO 80201", "(303) 555-9013", "9013"}},        
    });

    // iter = find(const Key& key);
    auto it = phone_book.find("Robert Williams");
    if(it != phone_book.end()){
        auto [name, record] = *it;
        std::println("Key = {}", name);
        std::println("Full name = {}", record.full_name);
        std::println("Address = {}", record.address);
        std::println("Phone = {}", record.phone);
    }

    // contains(), at()
    std::string key {"Robert Williams"};
    if(phone_book.contains(key))
    {
        std::println("Key exists: {}", key);
        auto record = phone_book.at(key);
    }

    // Array index operator
    // operator[]() returns a reference to the value that
    // is mapped to the key k, performing an insertion if
    // such a key does not exist.
    std::unordered_map<char, int> letter_counts{{'a', 27}, {'b', 3}, {'c', 1}};
    println("letter_counts initially contains: {}", letter_counts);
    int result1 = letter_counts['a'];   // ok
    int result2 = letter_counts['d'];   // just inserts the pair ('d', 0)
    println("letter_counts now contains: {}", letter_counts);

    // at(Key& key)
    // Returns a reference to the mapped value of the element
    // with specified key. If no such element exists, an
    // exception of the type std::out_of_range is thrown.
    int result3 = letter_counts.at('a');   
    try{
        int result4 = letter_counts.at('e');
    }catch(std::out_of_range& ex){
        std::println("Out of range exception occured!");
    }
    
    return 0;
}

Compiler Explorer

Deletion

#include <unordered_map>
#include <print>
#include <cstddef>
#include <vector>
#include <utility>

struct PhoneRecord{
    std::string full_name;
    std::string address;
    std::string phone;
    std::string pin;
};
int main(){
    std::unordered_map<int, std::string> map{
        {1, "one"}, {2, "two"}, {3, "three"},
        {4, "four"}, {5, "five"}, {6, "six"}
    };

    // erase(pos) by iterator
    // It removes the specified element from the container
    // and returns an iterator following the last modified
    // element.

    // The order of the remaining elements is preserved.
    // This makes it possible to erase individual elements
    // while iterating the container.

    // pos is an iterator to the element to be removed.
    // Any references and iterators to erased elements 
    // are invalidated. 
    for(auto it{map.begin()}; it!=map.end();){
        if(it->first % 2 != 0)
            it = map.erase(it);
        else
            ++it;
    }

    // erase(key) by key
    // It removes the element (if one exists) with the key.
    // It returns the number of elements removed (0 or 1).
    std::size_t num_elements_removed = map.erase(4);

    // clear
    // clear() removes all elements
    map.clear();

    // node_type extract(pos)
    // unlinks the node that contains the element pointed to 
    // by pos and returns a node handle that owns it.
    
    // node_type extract(key)
    // unlinks the node corresponding to key and returns
    // a node_handle to it, else returns an empty node_handle
    
}

Compiler Explorer

How to choose a good hash function?

Input data often has patterns and is not random. So, good hash functions should generate hash values that are uniformly distributed.
Avalanche effect. A small change in the input (e.g. flipping a single bit) should change the output hash-value significantly.
Hash functions should be fast. Cryptographic hash functions like SHA-1, SHA-256, SHA-512, MD-5 are extremely secure, and we use them everyday for encryption, checksumming etc. but they are slow to compute. They are designed such that it is maximally difficult to produce collisions.

At it’s core, a hash function can simply be described as follows:

def hash_function(data: bytes) -> int:
    for chunk in data:
        hash = cleverness(hash, chunk)

    return hash

Basically, you have a rolling hash value, we can think of it as a context or a state, but its ultimately just a number and we have a block of input data chunk (perhaps \(1\) byte or \(4\) bytes), and you mix those and then you will generate another number. Now, you may have some pre- and post- processing depending on the function, but that’s pretty much how it works. The cleverness thing is actually what’s called a mixing function. If you ever see the word mixing function on the wikipedia page, this is what it refers to.

How do you implement this cleverness function?

In general you do it with unsigned, fixed length integers. You use:

Bitwise operations (xor, rotation, unsigned bit shifts)
Multiplication
Usually, there is atleast one magic number e.g. a prime number.

In hash table implementations, buckets are often a power of \(2\) e.g. you might have \(128\) buckets or \(256\). If you have the same factorization of the number of buckets as some of the constants to manipulate the state of the hash function, you can end up bucketing everything actually into the same thing.

FNV-1A

FNV1-A is venerable hash function created by Fowler, Knoll and Vo in 1991 and has been improved a couple of times since. It’s still in wide use today.

#include <cstddef>
#include <cstdint>

uint32_t fnv1a32(std::byte* bytes, std::size_t num_bytes){
    // This is actually the hash of the email 
    // signatures of one of the creators of FNV.
    uint32_t hash = 2166136261;
    for(auto i{0uz}; i<num_bytes; ++i){
        hash ^= static_cast<int>(bytes[i]);
        hash *= 16777619;

    }
    return hash;
}

Compiler Explorer Suppose that the input string is "ab". We get the following hash code from the FNV algorithm:

hash(0) = 10000001000111001001110111000101
byte[0] =                          1100001
XOR     = 10000001000111001001110110100100
hash(1) = 11100100000011000010100100101100
byte[1] =                          1100010
XOR     = 11100100000011000010100101001110
hash(2) =  1001101001001010000010111001010

The XOR changes only a few bits, so that’s why the multiplier is there. Actually, \(16777619\) is the prime closest to \(2^24\). It gives you bit-shifting behavior as well. FNV1a kind of hides this bit-shifting, but every other algorithm uses bit shifts heavily to get this avalanche behavior. So, what this is doing is, it’s going to shift everything and then its going add some noise to the lower end of the number.

16'777'619 in binary is 0b0000 0001 0000 0000 0000 0001 1001 0011. When you multiply this number, you are essentially doing:

\[ hash \times 16777619 = hash \times (2^{24} + 2^8 + 2^7 + 2^4 + 2^1 + 2^0) \]

To keep things simple, consider the value 5 = 0b0101. Assume we multiply \(5 \times 8 = 5 \times 2^3\). Using the rules of binary multiplication, we get:

 0b0000 0101
x0b0000 1000
------------
           0
          00
         000
 0b 010 1000
------------
 0b 010 1000

Hence, x * 2^3 is like x << 3.

So, the first byte a = 0b0110 0001 has a hash value of 0b1110 0100 0000 1100 0010 1001 0010 1100. If we look at the next byte b = 0b0110 0010, it has just 2 bits different from the first byte, but the hash value is 0b0100 1101 0010 0101 0000 0101 1100 1010.

0b1110 0100 0000 1100 0010 1001 0010 1100
0b0100 1101 0010 0101 0000 0101 1100 1010
  ^ ^  ^  ^   ^  ^  ^   ^  ^^   ^^^   ^^

You can see how well distributed are two similar bytes. Every nibble has \(1\) or \(2\) bits that are being changed. So, we are already seeing a real avalanche behavior here. Things are getting moved, things are getting mutated, it all looks cool.

Let us look at few hash functions that are considered to be reasonably state-of-the-art.

Murmur Hash

algorithm Murmur3_32 is
    // Note: In this version, all arithmetic is performed with unsigned 32-bit integers.
    //       In the case of overflow, the result is reduced modulo 232.
    input: data, len, seed

    c1 ← 0xcc9e2d51
    c2 ← 0x1b873593
    r1 ← 15
    r2 ← 13
    m ← 5
    n ← 0xe6546b64

    hash ← seed

    for each fourByteChunk of data do
        k ← fourByteChunk

        k ← k × c1
        k ← (k << r1) | (k >> (32 - r1))
        k ← k × c2

        hash ← hash XOR k
        hash ← (hash << r2) | (hash >> (32 - r2))
        hash ← (hash × m) + n

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
        // Note: Endian swapping is only necessary on big-endian machines.
        //       The purpose is to place the meaningful digits towards the low end of the value,
        //       so that these digits have the greatest potential to affect the low range digits
        //       in the subsequent multiplication.  Consider that locating the meaningful digits
        //       in the high range would produce a greater effect upon the high digits of the
        //       multiplication, and notably, that such high digits are likely to be discarded
        //       by the modulo arithmetic under overflow.  We don't want that.

        remainingBytes ← remainingBytes × c1
        remainingBytes ← (remainingBytes << r1) | (remainingBytes >> (32 - r1))
        remainingBytes ← remainingBytes × c2

        hash ← hash XOR remainingBytes

    hash ← hash XOR len

    hash ← hash XOR (hash >> 16)
    hash ← hash × 0x85ebca6b
    hash ← hash XOR (hash >> 13)
    hash ← hash × 0xc2b2ae35
    hash ← hash XOR (hash >> 16)

Murmur3 hash has a few constants. Quite a few of those are primes. There is also a seed value that sometimes the user provides or is hardcoded at times. It also means that, if you want to generate different hash values for different users, you have a way to do it.

Then we start the mixing function. Murmur hash operates on \(4\) bytes at a time, rather than \(1\) byte. We multiply, do some bit shifting to try and get the values distributed more evenly, then again multiply, exclusive OR and then more stuff and it gets mutated.

If there are less than \(4\) bytes, remaining then there is an alternative mixing function.

Finally, at the end we have, we have a series of post-processing steps to try and introduce more randomness into the whole thing. Again, we have multiplication with some constants and bit shifting to maximize distributing the hash.

hash ← hash XOR len

hash ← hash XOR (hash >> 16)
hash ← hash × 0x85ebca6b
hash ← hash XOR (hash >> 13)
hash ← hash × 0xc2b2ae35
hash ← hash XOR (hash >> 16)

The theory of hashing

The goal of hashing is to provide a solution that is faster than binary trees.

The formal setup for hashing is as follows:

The keys come from some large universe \(U\).
We will perform inserts and lookups by having an array of size \(M\), and a hash function \(h:U \to \{0,\ldots,M-1\}\). Given an element \(x\), the idea of hashing is we want to store it in \(A[h(x)]\).
We need a method for resolving collisions. A collision is when \(h(x) = h(y)\) for two different \(x\) and \(y\). There are two methods to handle collisions:
- Separate chaining
- Open addressing

Separate chaining

In the event, that multiple keys are hashed to the same slot, we store the keys in a linked list on that slot.

Let’s compute the mean and variance of the chain length. For slot \(t\), let \(C_t\) be the random variable equal to length of the linked list chain corresponding to slot \(t\).

The expected chain length \(\mathbf{E}[C_t]\) is constant. We can write \(I_j\) be the indicator random variable which is \(1\), if the hash of \(x_j\) falls in bucket \(t\), else 0. Then, we have:

\[ \begin{align*} C_t &= I_1 + \ldots + I_n \\ \mathbf{C_t} &= \sum_{j=1}^{n}\mathbf{E}I_j \\ &= \sum_{j=1}^n P(h(x_j) = t) \\ &= \sum_{j=1}^n \frac{1}{m} \\ &= \frac{n}{m} \end{align*} \]

where \(\alpha = n/m\) is frequently called the load factor. The load factor is constant if we assume that \(c_1 m \leq n \leq c_2 n\) or equivalently \(n = \Theta(n)\). This assumption can be kept satisfied by doubling the hash table size as needed.

Hence, our average case lookup time is \(O(1+\alpha) = O(1+\frac{n}{m})\)

Open addressing

Open addressing is another collision resolution technique in which all {key,value} pairs are stored in hash table itself. Suppose our hash table has \(m\) entries. We can think of the hash function as a function of \(2\) parameters \(h(k,i)\) where \(k\) is the key, \(i\) is the trial count.

\[ h: \underbrace{\mathcal{U}}_{\text{Key Universe}} \times \underbrace{\{0,1,\ldots,m_1\}}_{\text{trial count}} \to \underbrace{\{0,1,\ldots,m-1\}}_{\text{slot in table}} \]

\(<h(k,0),\ldots,h(k,m-1)>\) is some permutation of \(0,1,\ldots,m-1\). If we keep trying \(h(k,i)\) for increasing \(i\), we will eventually hit all slots of the table.

Inserting items.

We keep probing until an empty slot is found. When an empty slot is found, we insert the key-value pair into that slot.

def insert(k, v):
    for i in range(m):
        if T[h(k,i)] is None:
            T[h(k,i)] = (k,v)
            return
    
    raise "Full"

Searching items.

As long as the slots we encounter by probing are occupied by \(keys \neq k\), keep probing until you either encounter \(k\) or find an empty slot - return success or *failure respectively.

for i in xrange(m):
    if T[h(k,i)] is None:
        return None
    else if T[h(k,i)][0] == k:
        return T[h(k,i)]

return None

Deleting items

We cannot just find an item and remove it from its slot (that is set T[h(k,i)] = None). For example, if we perfom delete(586) and mark the slot as None, then search(496) will fail. Instead, we replace a deleted item with a special flag : DeleteMe, which is a tombstone. insert(k,v) treats this as None, but search(k) doesnt.

    +-------+
0   |       |
    |-------|
1   |  586  | probe h(496,1)
    |-------|
2   |  133  | probe h(496,0)
    |-------|
3   |       |
    |-------|
4   |  204  | probe h(496,2)
    |-------|
5   |  496  | probe h(496,3)
    |-------|
6   |  481  |
    |-------|
7   |       |
    |-------|
    |       |
    |-------|    
    |       |
    |-------|    
m-1 |       |
    +-------+

Probing Strategies

Linear Probing

Linear probing is like street parking. Here the probe function is linear of the form \(p(k,i) = c i\), where \(c\) is a constant. We then have, \(h(k,i) = (h(k) + p(k)) \mod m\) where \(h(k)\) is a hash function. It suffers from clustering. If you have a large cluster, and \(h(k,0)\) hashes to any of the slots that fall in this cluster, the cluster is more likely to grow further.

Double Hashing

\(h(k,i) = (h_1(k) + i \cdot h_2(k)) \mod m\) where \(h_1(k)\) and \(h_2(k)\) are two ordinary hash functions.

If \(h_2(k)\) is coprime to \(m\) for all \(k\), we will actually hit all slots (permutation) before exhausting. That’s because:

\[ h_1(k) + i \cdot h_2(k) \mod m = h_1(k) + j \cdot h_2(k) \mod m \implies m \text{ divides }(i-j) \]

Pseudo-random probing

An ideal probe function would select the next position on the probe sequence at random from among the unvisited slots; that is, the probe sequence should be a random permutation of the hash table positions. Unfortunately, we cannot actually seelect the next position in the probe sequence at random, because we would not be able to replicate the same probe sequence when searching for the key. However, we can do something similar called pseudo-random probing. In pseudo-random probing, the \(i\)th slot in the probe sequence is \((h(K) + r_i)\) where \(r_i\) is the \(i\)th value in a random permutation of the numbers \(1,\ldots, m-1\). All inserts and searches use the same random number stream (sequence of random numbers).

Quadratic Probing

Another probe function that eliminates primary clustering is called quadratic probing. Here the probe function is some quadratic function \(p(k,i)=c_1i^2 + c_2 i + c_3\) for some choice of constants \(c_1, c_2, c_3\).