The hash table is one of the most fundamental and widely used data structues. It supports fast insertions and retrievals by key-value pair. In this article, we'll introduce the data structure, discuss the core concepts, then show a full implementation in multiple languages.
Hash Table
A hash table, also referred to as hash map, dictionary, or associative array, is a data structure which holds a mapping between keys and values. Hash tables provide efficient key indexing and are useful in situations where values need to be accessed via key index. A few common use cases include:
- Categorizing data: if we want to categorize certain types of data, we can store the data in a hash table, where the key represents the category and the value represents an object or collection of objects which belong to that category.
- Caching by identifier: to speed up lookups, we can create a redundant copy of data we already have in an array or other data structure using a hash table. The key will be some property of the value we want to index, and the value will be the object itself.
- Representing a hierarchical structure: hash tables can themselves store other hash tables as values. Therefore, we can use hash tables in a similar manner to trees and store hierarchical structures. However, this may or may not be a suitable way to store such a structure depending upon the programming language. A JavaScript object is well suited for this purpose, but our hash table implementation in this article is not as well suited.
Implementing a Hash Table
We will implement our hash table using a class which holds the items within an array. We will implement our class so that we can use a key to index a position within the array. Since arrays can only be accessed by numeric indices, we will use a process called hashing to map any given key to an index in our array. When we insert, get, or remove items, the key will be hashed to the same array index and we will be able to directly access the corresponding array index in O(1) time. If at any point, two keys hash to the same index, we'll use a resolution strategy called seperate chaining to ensure we don't accidentally overwrite values. Lastly, we will resize our internal array as the number of items grows to help avoid collisions as our structure grows. We'll cover each point in more detail in the sections below.
Hashing Algorithm
A hashing algorithm is an algorithm which takes some object, usually a string, and generates an integer value. It will always generate the same output for the same input, which means that such an algorithm cannot include randomness or factors which will change over time. Many programming languages have built-in hashing algorithms available for use. In our implementation here, we will create a custom hashing algorithm which will take the character code of each character in the key, hold a value for the sum of the codes and a value for the product, add the sum and product, and use that total as the hash. If an overflow error happens at any point, we can reset to 0 and proceed again. This is a decent hashing algorithm for demonstration purposes, as the sum + product, instead of just one, will help prevent collisions, but in a production system, a better, or built-in, algorithm should be used instead.
Once the hashing algorithm is in place, we can use it to determine the index in our internal array where we will be storing the value associated with the key. To find the index: hash the key into a number, then take the modulus of the key with the size of our internal array: int index = hashCode % internalArray.length
.
Hash Collisions
A collision is a situation which arises when two seperate keys are hashed to the same value. A perfect hashing algorithm would prevent this situation entirely, but it is not possible to create a perfect algorithm, so any hash table implementation must be able to handle collisions. There are two types of collisions which can occur:
- The hashes of the two keys, modulus the internal array size, map to the same index: this is dependent upon the internal state of the data structure and can be resolved by resizing the internal array.
- The hashing algorithm itself hashes two keys to the same number: this will cause an issue regardless of the state of the structure and can only be resolved if the hashing algorithm is changed.
A decent hashing algorithm will mitigate the first issue, and occasionally resizing the inner array will mitigate the second. However, since there is no way to prevent a collision, we must account for it. The section below outlines how to handle collisions with seperate chaining.
Seperate Chaining
To handle the issue of collisions, seperate chaining uses a linked list to store each key value pair of keys which hash to the same value. In the internal array, instead of storing values directly, we'll store a linked list node reference to a node which holds the key and value of the first item inserted with its hash value. At that point, when a collision occurs, subsequent entries can be appended to the end of the list. When inserting, retrieving, and deleting, we'll need to do the normal process of hashing and accessing the appropriate array index, then iterate over the linked list until we find the value we're looking for. Note: if you are not familiar with linked lists, please visit our introduction to linked lists.
Time Complexity
In the table below, we'll take a look at the average case and worst case time complexity of hashed table operations.
Operation | Average Case Complexity | Worst Case Complexity | Notes |
---|---|---|---|
Get Value by Key | O(1) | O(n) | Worst case accounts for hash collisions. |
Search for Value | O(n) | O(n) | Hash tables, by default, have no way of searching over values themselves. |
Set Key/Value | O(1) | O(n) | Worst case accounts for hash collisions. |
Delete Key | O(1) | O(n) | Worst case accounts for hash collisions. |
Full Implementation
Our implementation below brings everything together: internal array for O(1) access, simple hashing algorithm for getting the array index for a key, seperate chaining for handling collisions, and internal array resizing when the structure grows. Every time the structure reaches 2/3 capacity, the internal array will double in size and all items will be re-assigned: