1. Core Essence of Hash TablesHash Table is an efficient data structure based on “key-value mapping”, where the core idea is to “convert the key into an array index using a hash function”, allowing direct access to data via the index, achieving an average lookup/insertion/deletion efficiency of O(1).
- Core Logic: key → Hash Function → Array Index → Access Corresponding Value (value).
- Why is it fast? It skips steps like array traversal and tree structure comparison, directly “locating” the data position, providing significant advantages over linear structures (like arrays, linked lists) and tree structures (like red-black trees) under large data volumes.
2. Hash CollisionWhen two different keys producethe same array index through the hash function, a collision occurs.
1. Two Classic Solutions
Separate Chaining
- Logic: Each index position in the array stores a linked list (or red-black tree), where the values corresponding to the colliding keys are sequentially inserted into the list/tree.
- Characteristics: Simple to implement; if collisions increase, it only affects the query efficiency of the corresponding list (degrades to O(n), but is optimized to O(log n) with red-black trees).
- Application: C++ STL’s
<span>unordered_map</span>/<span>unordered_set</span>is based on “array + linked list/red-black tree” (automatically converts to red-black tree when the list length exceeds a threshold to improve efficiency).
Open Addressing
- Logic: In case of a collision, it searches for the “next available position” in the array according to certain rules (like linear probing, quadratic probing, double hashing) to store the data.
- Characteristics: No extra storage for lists, high memory utilization, but prone to “clustering” (multiple colliding keys pile up), and when deleting data, it must be marked as “deleted” (to avoid affecting subsequent searches).
- Application: Java’s
<span>HashMap</span>initially used linear probing, later switched to separate chaining.
2. Key to Reducing Collisions: Hash Function Design
- Core Requirement: The hash function should distribute keys “uniformly” across the array indices, avoiding concentration on a few positions.
- Common Design Ideas: Modulus operation (e.g.,
<span>hash(key) % array size</span>), bit manipulation (shift + XOR), multiplication hashing, etc. - Note: The array size is usually designed as a prime number or a power of 2 to further enhance uniform distribution.
3. Advantages and Disadvantages of Hash Tables
- Advantages:
- Average time complexity O(1): Extremely high efficiency for lookup, insertion, and deletion, making it the first choice for large data scenarios.
- Supports fast key-value mapping: No need for traversal, directly locates via key, suitable for caching, dictionaries, etc.
- Disadvantages:
- Collisions are inevitable: Even with a well-designed hash function, collisions can occur when the number of keys exceeds the array size.
- Lower space utilization: To reduce collisions, the array typically reserves some free space (load factor = number of elements / array size, STL default load factor is 0.75, exceeding this triggers resizing).
- Unordered storage: The order of elements in a hash table is determined by the hash function and collision resolution method, and cannot guarantee insertion order (this is also the core difference with
<span><span>map</span></span>:<span><span>map</span></span>is ordered, while<span><span>unordered_map</span></span>is unordered).
4. Advanced Application: Bloom Filter
The Bloom filter is a “derived optimization” of the hash table, specifically designed to solve the “rapid existence judgment of massive data” (such as cache penetration, blacklist verification).
- Core Logic: Uses a super large binary array + multiple independent hash functions, where keys are mapped to multiple positions in the array by all hash functions, setting these positions to 1; during judgment, if all mapped positions are 1, it “may exist”; if any is 0, it “definitely does not exist”.
- Characteristics:
- Advantages: Extremely high space efficiency (binary storage), very fast query speed (multiple hash functions computed in parallel).
- Disadvantages: False positives (may mistakenly judge “exists”), cannot delete elements (deletion affects other keys).
- Difference from hash tables: Hash tables can store key-value pairs and support precise lookups; Bloom filters only store “existence markers”, not specific data, sacrificing precision for space and speed.
5. Hash Table Applications in STL
| Container | Underlying Implementation | Core Characteristics | Time Complexity (Average) |
| unordered_map | Hash Table (Separate Chaining) | Unordered, supports key-value mapping | Lookup / Insertion O(1) |
| unordered_set | Hash Table (Separate Chaining) | Unordered, deduplication (only stores keys) | Lookup / Insertion O(1) |
| map | Red-Black Tree | Ordered (sorted by key), key-value mapping | Lookup / Insertion O(log n) |
| set | Red-Black Tree | Ordered, deduplication | Lookup / Insertion O(log n) |
Use <span>unordered_*</span> when you need “unordered + fast lookup”; use <span>map</span>/<span>set</span><span><span> (red-black tree) when you need "ordered + dynamic maintenance".</span></span>
6. Detailed Explanation of std::unordered_map Hash Algorithm Implementation
1. Basic Concepts
The core idea of hash tables
Key → Hash Function → Hash Value (size_t) → Bucket Index → Storage Location
Hash tables map keys to array indices via hash functions, achieving an average time complexity of O(1) for lookups.
2. Structure Composition
Based on the previously seen xhash source code, std::unordered_map includes:
a) Bucket Array
// Conceptual structure_Vec._Mypair._Myval2._Myfirst // Bucket array pointer// Each bucket stores the head and tail pointers of a linked list
-
Bucket Array: A fixed-size array, where each element is a bucket
-
Bucket: Stores elements with the same hash value
b) Linked List (for handling collisions)
// Conceptual structure_List._Mypair._Myval2._Myhead // Head node of the linked list
-
Uses separate chaining to handle collisions
-
Each bucket maintains a linked list to store elements with the same hash value
3. Role of Hash Function
struct string_hash{#ifdef __cpp_lib_string_view using hash_type = std::hash<std::string_view>;#else using hash_type = std::hash<std::string>;#endif using is_transparent = void; std::size_t operator()(const char* str) const { return hash_type{}(std::string(str)); }#ifdef __cpp_lib_string_view std::size_t operator()(std::string_view str) const { return hash_type{}(str); }#endif std::size_t operator()(const std::string& str) const { return hash_type{}(str); }};
Responsibilities of the hash function:
-
Convert any type of key to size_t
-
Distribute uniformly to minimize collisions
-
Same keys must produce the same hash value
Example:
std::unordered_map<std::string, int, string_hash> map;map["hello"] = 42; // Hash process: // 1. "hello" → string_hash::operator()("hello") // 2. Returns size_t value, e.g., 123456789 // 3. Calculates bucket index: 123456789 % bucket_count
4. Bucket Index Calculation
From the source code (xhash:1499):
const size_type _Bucket = _Hashval & _Mask; // Bitwise operation for modulus
Principle:
// Traditional method (conceptually) size_t bucket_index = hash_value % bucket_count; // Actual implementation (optimization) size_t bucket_index = hash_value & mask; // mask = bucket_count - 1 // Requires bucket_count to be a power of 2
Why use bitwise operations?
Faster: & is faster than %
Requirement: The number of buckets must be a power of 2 (e.g., 8, 16, 32, 64…)
5. Collision Handling: Separate Chaining
Data Structure:
Bucket Array:[0] → [key1, val1] → [key2, val2] → nullptr[1] → nullptr[2] → [key3, val3] → nullptr[3] → [key4, val4] → [key5, val5] → [key6, val6] → nullptr...
Collision Example:
std::unordered_map<std::string, int> map;map["apple"] = 1; // hash("apple") % 8 = 3map["banana"] = 2; // hash("banana") % 8 = 3 (collision!)map["cherry"] = 3; // hash("cherry") % 8 = 3 (collision!) // Bucket[3]'s linked list: // [apple, 1] → [banana, 2] → [cherry, 3] → nullptr
6. Implementation of Find Operation
Based on the source code (xhash:1497-1525):
template <class _Keyty>_Find_last(const _Keyty& _Keyval, const size_t _Hashval) const { // 1. Calculate bucket index const size_type _Bucket = _Hashval & _Mask; // 2. Get the starting position of the bucket _Nodeptr _Where = _Vec._Mypair._Myval2._Myfirst[(_Bucket << 1) + 1]._Ptr; // 3. Traverse the linked list to find for (;;) { // 4. Compare keys (using comparator) if (!_Traitsobj(_Keyval, _Traits::_Kfn(_Where->_Myval))) { // Found! return {_Where->_Next, _Where}; } // 5. Continue to the next node _Where = _Where->_Prev; }}
Lookup Process:
1. Calculate hash value: hash(key)2. Calculate bucket index: hash_value & mask3. Traverse the linked list in the bucket: - Compare each node's key - If found, return iterator - If at the end of the list, return end()
Time Complexity:
-
Average case: O(1) – if the hash function distributes uniformly
-
Worst case: O(n) – if all elements are in the same bucket
7. Implementation of Insert Operation
Insertion Process:
// Pseudocode1. Calculate hash value and bucket index2. Check if it already exists in the bucket's linked list3. If exists: - unordered_map: do not insert (unique key) - unordered_multimap: insert (allows multiple same keys)4. If not exists: - Create new node - Insert at the head or tail of the list - Update bucket pointer5. Check if rehashing is needed
From the source code (xhash:1545-1569):
_Insert_new_node_before(const size_t _Hashval, const _Nodeptr _Insert_before, const _Nodeptr _Newnode) { // 1. Update linked list pointers _Insert_after->_Next = _Newnode; _Insert_before->_Prev = _Newnode; // 2. Update bucket pointer if (_Bucket_lo._Ptr == _Head) { // Bucket is empty, set head and tail _Bucket_lo._Ptr = _Newnode; _Bucket_hi._Ptr = _Newnode; } else if (_Bucket_lo._Ptr == _Insert_before) { // New node is the first in the bucket _Bucket_lo._Ptr = _Newnode; } // ...}
8. Load Factor and Rehashing
Load Factor:
load_factor = size() / bucket_count()
-
Default maximum load factor: usually 1.0
-
Triggers rehashing when the load factor exceeds a threshold
Rehashing Process:
// Pseudocode1. Create a larger bucket array (usually double the original)2. Recalculate hash values for all elements3. Redistribute all elements into the new buckets4. Free the old bucket array
From the source code:
// Check if rehashing is requiredif (this->_Check_rehash_required_1()) { this->_Rehash_for_1(); // Perform rehashing // Re-find insertion position _Target = this->_Find_last(_Newnode._Ptr->_Myval.first, _Hashval);}
9. Complete Example: Insertion and Lookup Process
std::unordered_map<std::string, int> map; // Insert "apple" = 1 // 1. hash("apple") = 123456789 // 2. bucket = 123456789 % 8 = 5 // 3. Bucket[5] is empty, insert directly // Bucket Array[5] → [apple, 1] → nullptr // Insert "banana" = 2 // 1. hash("banana") = 987654321 // 2. bucket = 987654321 % 8 = 1 // 3. Bucket[1] is empty, insert directly // Bucket Array[1] → [banana, 2] → nullptr // Insert "cherry" = 3 (assumed to collide with "apple") // 1. hash("cherry") = 123456789 (assumed same) // 2. bucket = 123456789 % 8 = 5 // 3. Bucket[5] already has elements, add to linked list // Bucket Array[5] → [cherry, 3] → [apple, 1] → nullptr // Lookup "apple" // 1. hash("apple") = 123456789 // 2. bucket = 5 // 3. Traverse linked list in Bucket[5]: // - Compare "cherry" != "apple", continue // - Compare "apple" == "apple", found!
struct string_hash{#ifdef __cpp_lib_string_view using hash_type = std::hash<std::string_view>;#else using hash_type = std::hash<std::string>;#endif using is_transparent = void; std::size_t operator()(const char* str) const { return hash_type{}(std::string(str)); }#ifdef __cpp_lib_string_view std::size_t operator()(std::string_view str) const { return hash_type{}(str); }#endif std::size_t operator()(const std::string& str) const { return hash_type{}(str); }};
String Lookup:
std::unordered_map<std::string, size_t, string_hash, std::equal_to<>> map{ {"one", 1} };
When executing map.find(“one”):
1. Calls string_hash::operator()(const char*) → Converts "one" to std::string → Calls std::hash<std::string>("one") → Returns hash value, e.g., 12345 2. Calculates bucket index: 12345 & mask 3. Looks up in the bucket's linked list: - Uses std::equal_to<> to compare "one" with keys in the list - Finds matching node 4. Returns iterator
10. Summary
The core of the hash algorithm in std::unordered_map:
-
Hash Function: Maps keys to size_t
-
Bucket Array: A fixed-size array, where each bucket is a linked list
-
Separate Chaining: Stores collisions in linked lists
-
Bitwise Optimization: Uses & instead of % for bucket index calculation
-
Rehashing: Expands and redistributes when load factor is too high
This design provides an average time complexity of O(1) in most cases, making it an efficient key-value pair container.