Understanding Hash Tables in C++

1. Core Essence of Hash TablesHash Table is an efficient data structure based on “key-value mapping”, where the core idea is to “convert the key into an array index using a hash function”, allowing direct access to data via the index, achieving an average lookup/insertion/deletion efficiency of O(1).

Core Logic: key → Hash Function → Array Index → Access Corresponding Value (value).
Why is it fast? It skips steps like array traversal and tree structure comparison, directly “locating” the data position, providing significant advantages over linear structures (like arrays, linked lists) and tree structures (like red-black trees) under large data volumes.

2. Hash CollisionWhen two different keys producethe same array index through the hash function, a collision occurs.

1. Two Classic Solutions

Separate Chaining

Logic: Each index position in the array stores a linked list (or red-black tree), where the values corresponding to the colliding keys are sequentially inserted into the list/tree.
Characteristics: Simple to implement; if collisions increase, it only affects the query efficiency of the corresponding list (degrades to O(n), but is optimized to O(log n) with red-black trees).
Application: C++ STL’s unordered_map/unordered_set is based on “array + linked list/red-black tree” (automatically converts to red-black tree when the list length exceeds a threshold to improve efficiency).

Open Addressing

Logic: In case of a collision, it searches for the “next available position” in the array according to certain rules (like linear probing, quadratic probing, double hashing) to store the data.
Characteristics: No extra storage for lists, high memory utilization, but prone to “clustering” (multiple colliding keys pile up), and when deleting data, it must be marked as “deleted” (to avoid affecting subsequent searches).
Application: Java’s HashMap initially used linear probing, later switched to separate chaining.

2. Key to Reducing Collisions: Hash Function Design

Core Requirement: The hash function should distribute keys “uniformly” across the array indices, avoiding concentration on a few positions.
Common Design Ideas: Modulus operation (e.g., hash(key) % array size), bit manipulation (shift + XOR), multiplication hashing, etc.
Note: The array size is usually designed as a prime number or a power of 2 to further enhance uniform distribution.

3. Advantages and Disadvantages of Hash Tables

Advantages:

Average time complexity O(1): Extremely high efficiency for lookup, insertion, and deletion, making it the first choice for large data scenarios.
Supports fast key-value mapping: No need for traversal, directly locates via key, suitable for caching, dictionaries, etc.

Disadvantages:

Collisions are inevitable: Even with a well-designed hash function, collisions can occur when the number of keys exceeds the array size.
Lower space utilization: To reduce collisions, the array typically reserves some free space (load factor = number of elements / array size, STL default load factor is 0.75, exceeding this triggers resizing).
Unordered storage: The order of elements in a hash table is determined by the hash function and collision resolution method, and cannot guarantee insertion order (this is also the core difference with map: map is ordered, while unordered_map is unordered).

4. Advanced Application: Bloom Filter

The Bloom filter is a “derived optimization” of the hash table, specifically designed to solve the “rapid existence judgment of massive data” (such as cache penetration, blacklist verification).

Core Logic: Uses a super large binary array + multiple independent hash functions, where keys are mapped to multiple positions in the array by all hash functions, setting these positions to 1; during judgment, if all mapped positions are 1, it “may exist”; if any is 0, it “definitely does not exist”.
Characteristics:

Advantages: Extremely high space efficiency (binary storage), very fast query speed (multiple hash functions computed in parallel).
Disadvantages: False positives (may mistakenly judge “exists”), cannot delete elements (deletion affects other keys).

Difference from hash tables: Hash tables can store key-value pairs and support precise lookups; Bloom filters only store “existence markers”, not specific data, sacrificing precision for space and speed.

5. Hash Table Applications in STL

Container	Underlying Implementation	Core Characteristics	Time Complexity (Average)
unordered_map	Hash Table (Separate Chaining)	Unordered, supports key-value mapping	Lookup / Insertion O(1)
unordered_set	Hash Table (Separate Chaining)	Unordered, deduplication (only stores keys)	Lookup / Insertion O(1)
map	Red-Black Tree	Ordered (sorted by key), key-value mapping	Lookup / Insertion O(log n)
set	Red-Black Tree	Ordered, deduplication	Lookup / Insertion O(log n)

Use `unordered_*` when you need “unordered + fast lookup”; use `map`/`set (red-black tree) when you need "ordered + dynamic maintenance".`

6. Detailed Explanation of std::unordered_map Hash Algorithm Implementation

1. Basic Concepts

The core idea of hash tables

Key → Hash Function → Hash Value (size_t) → Bucket Index → Storage Location

Hash tables map keys to array indices via hash functions, achieving an average time complexity of O(1) for lookups.

2. Structure Composition

Based on the previously seen xhash source code, std::unordered_map includes:

a) Bucket Array

// Conceptual structure_Vec._Mypair._Myval2._Myfirst  // Bucket array pointer// Each bucket stores the head and tail pointers of a linked list

Bucket Array: A fixed-size array, where each element is a bucket
Bucket: Stores elements with the same hash value

b) Linked List (for handling collisions)

// Conceptual structure_List._Mypair._Myval2._Myhead  // Head node of the linked list

Uses separate chaining to handle collisions
Each bucket maintains a linked list to store elements with the same hash value

3. Role of Hash Function

struct string_hash{#ifdef __cpp_lib_string_view    using hash_type = std::hash<std::string_view>;#else    using hash_type = std::hash<std::string>;#endif    using is_transparent = void;    std::size_t operator()(const char* str) const { return hash_type{}(std::string(str)); }#ifdef __cpp_lib_string_view    std::size_t operator()(std::string_view str) const { return hash_type{}(str); }#endif    std::size_t operator()(const std::string& str) const { return hash_type{}(str); }};

Responsibilities of the hash function:

Convert any type of key to size_t
Distribute uniformly to minimize collisions
Same keys must produce the same hash value

Example:

std::unordered_map<std::string, int, string_hash> map;map["hello"] = 42; // Hash process: // 1. "hello" → string_hash::operator()("hello") // 2. Returns size_t value, e.g., 123456789 // 3. Calculates bucket index: 123456789 % bucket_count

4. Bucket Index Calculation

From the source code (xhash:1499):

const size_type _Bucket = _Hashval & _Mask;  // Bitwise operation for modulus

Principle:

// Traditional method (conceptually) size_t bucket_index = hash_value % bucket_count; // Actual implementation (optimization) size_t bucket_index = hash_value & mask;  // mask = bucket_count - 1 // Requires bucket_count to be a power of 2

Why use bitwise operations?

Faster: & is faster than %

Requirement: The number of buckets must be a power of 2 (e.g., 8, 16, 32, 64…)

5. Collision Handling: Separate Chaining

Data Structure:

Bucket Array:[0] → [key1, val1] → [key2, val2] → nullptr[1] → nullptr[2] → [key3, val3] → nullptr[3] → [key4, val4] → [key5, val5] → [key6, val6] → nullptr...

Collision Example:

std::unordered_map<std::string, int> map;map["apple"] = 1;   // hash("apple") % 8 = 3map["banana"] = 2;  // hash("banana") % 8 = 3  (collision!)map["cherry"] = 3;  // hash("cherry") % 8 = 3  (collision!) // Bucket[3]'s linked list: // [apple, 1] → [banana, 2] → [cherry, 3] → nullptr

6. Implementation of Find Operation

Based on the source code (xhash:1497-1525):

template <class _Keyty>_Find_last(const _Keyty& _Keyval, const size_t _Hashval) const {    // 1. Calculate bucket index    const size_type _Bucket = _Hashval & _Mask;    // 2. Get the starting position of the bucket    _Nodeptr _Where = _Vec._Mypair._Myval2._Myfirst[(_Bucket << 1) + 1]._Ptr;    // 3. Traverse the linked list to find    for (;;) {        // 4. Compare keys (using comparator)        if (!_Traitsobj(_Keyval, _Traits::_Kfn(_Where->_Myval))) {            // Found!            return {_Where->_Next, _Where};        }        // 5. Continue to the next node        _Where = _Where->_Prev;    }}

Lookup Process:

1. Calculate hash value: hash(key)2. Calculate bucket index: hash_value & mask3. Traverse the linked list in the bucket:   - Compare each node's key   - If found, return iterator   - If at the end of the list, return end()

Time Complexity:

Average case: O(1) – if the hash function distributes uniformly
Worst case: O(n) – if all elements are in the same bucket

7. Implementation of Insert Operation

Insertion Process:

// Pseudocode1. Calculate hash value and bucket index2. Check if it already exists in the bucket's linked list3. If exists:   - unordered_map: do not insert (unique key)   - unordered_multimap: insert (allows multiple same keys)4. If not exists:   - Create new node   - Insert at the head or tail of the list   - Update bucket pointer5. Check if rehashing is needed

From the source code (xhash:1545-1569):

_Insert_new_node_before(const size_t _Hashval, const _Nodeptr _Insert_before, const _Nodeptr _Newnode) {    // 1. Update linked list pointers    _Insert_after->_Next = _Newnode;    _Insert_before->_Prev = _Newnode;    // 2. Update bucket pointer    if (_Bucket_lo._Ptr == _Head) {        // Bucket is empty, set head and tail        _Bucket_lo._Ptr = _Newnode;        _Bucket_hi._Ptr = _Newnode;    } else if (_Bucket_lo._Ptr == _Insert_before) {        // New node is the first in the bucket        _Bucket_lo._Ptr = _Newnode;    }    // ...}

8. Load Factor and Rehashing

Load Factor:

load_factor = size() / bucket_count()

Default maximum load factor: usually 1.0
Triggers rehashing when the load factor exceeds a threshold

Rehashing Process:

// Pseudocode1. Create a larger bucket array (usually double the original)2. Recalculate hash values for all elements3. Redistribute all elements into the new buckets4. Free the old bucket array

From the source code:

// Check if rehashing is requiredif (this->_Check_rehash_required_1()) {    this->_Rehash_for_1();  // Perform rehashing    // Re-find insertion position    _Target = this->_Find_last(_Newnode._Ptr->_Myval.first, _Hashval);}

9. Complete Example: Insertion and Lookup Process

std::unordered_map<std::string, int> map; // Insert "apple" = 1 // 1. hash("apple") = 123456789 // 2. bucket = 123456789 % 8 = 5 // 3. Bucket[5] is empty, insert directly //    Bucket Array[5] → [apple, 1] → nullptr // Insert "banana" = 2 // 1. hash("banana") = 987654321 // 2. bucket = 987654321 % 8 = 1 // 3. Bucket[1] is empty, insert directly //    Bucket Array[1] → [banana, 2] → nullptr // Insert "cherry" = 3 (assumed to collide with "apple") // 1. hash("cherry") = 123456789 (assumed same) // 2. bucket = 123456789 % 8 = 5 // 3. Bucket[5] already has elements, add to linked list //    Bucket Array[5] → [cherry, 3] → [apple, 1] → nullptr // Lookup "apple" // 1. hash("apple") = 123456789 // 2. bucket = 5 // 3. Traverse linked list in Bucket[5]: //    - Compare "cherry" != "apple", continue //    - Compare "apple" == "apple", found!

struct string_hash{#ifdef __cpp_lib_string_view    using hash_type = std::hash<std::string_view>;#else    using hash_type = std::hash<std::string>;#endif    using is_transparent = void;    std::size_t operator()(const char* str) const { return hash_type{}(std::string(str)); }#ifdef __cpp_lib_string_view    std::size_t operator()(std::string_view str) const { return hash_type{}(str); }#endif    std::size_t operator()(const std::string& str) const { return hash_type{}(str); }};

String Lookup:

std::unordered_map<std::string, size_t, string_hash, std::equal_to<>> map{ {"one", 1} };

When executing map.find(“one”):

1. Calls string_hash::operator()(const char*)   → Converts "one" to std::string   → Calls std::hash<std::string>("one")   → Returns hash value, e.g., 12345 2. Calculates bucket index: 12345 & mask 3. Looks up in the bucket's linked list:   - Uses std::equal_to<> to compare "one" with keys in the list   - Finds matching node 4. Returns iterator

10. Summary

The core of the hash algorithm in std::unordered_map:

Hash Function: Maps keys to size_t
Bucket Array: A fixed-size array, where each bucket is a linked list
Separate Chaining: Stores collisions in linked lists
Bitwise Optimization: Uses & instead of % for bucket index calculation
Rehashing: Expands and redistributes when load factor is too high

This design provides an average time complexity of O(1) in most cases, making it an efficient key-value pair container.

Understanding Hash Tables in C++

1. Two Classic Solutions

Separate Chaining

3. Advantages and Disadvantages of Hash Tables

4. Advanced Application: Bloom Filter

5. Hash Table Applications in STL

Use `<span>unordered_*</span>` when you need “unordered + fast lookup”; use `<span>map</span>`/`<span>set</span><span><span> (red-black tree) when you need "ordered + dynamic maintenance".</span></span>`

Leave a Comment Cancel reply

1. Two Classic Solutions

Separate Chaining

3. Advantages and Disadvantages of Hash Tables

4. Advanced Application: Bloom Filter

5. Hash Table Applications in STL

Use <span>unordered_*</span> when you need “unordered + fast lookup”; use <span>map</span>/<span>set</span><span><span> (red-black tree) when you need "ordered + dynamic maintenance".</span></span>

Related posts

Leave a Comment Cancel reply

Use `<span>unordered_*</span>` when you need “unordered + fast lookup”; use `<span>map</span>`/`<span>set</span><span><span> (red-black tree) when you need "ordered + dynamic maintenance".</span></span>`