The hash table is one of the most fundamental and widely used data structues. It supports fast insertions and retrievals by key-value pair. In this article, we'll introduce the data structure, discuss the core concepts, then show a full implementation in multiple languages.
- Categorizing data: if we want to categorize certain types of data, we can store the data in a hash table, where the key represents the category and the value represents an object or collection of objects which belong to that category.
- Caching by identifier: to speed up lookups, we can create a redundant copy of data we already have in an array or other data structure using a hash table. The key will be some property of the value we want to index, and the value will be the object itself.
Implementing a Hash Table
We will implement our hash table using a class which holds the items within an array. We will implement our class so that we can use a key to index a position within the array. Since arrays can only be accessed by numeric indices, we will use a process called hashing to map any given key to an index in our array. When we insert, get, or remove items, the key will be hashed to the same array index and we will be able to directly access the corresponding array index in O(1) time. If at any point, two keys hash to the same index, we'll use a resolution strategy called seperate chaining to ensure we don't accidentally overwrite values. Lastly, we will resize our internal array as the number of items grows to help avoid collisions as our structure grows. We'll cover each point in more detail in the sections below.
A hashing algorithm is an algorithm which takes some object, usually a string, and generates an integer value. It will always generate the same output for the same input, which means that such an algorithm cannot include randomness or factors which will change over time. Many programming languages have built-in hashing algorithms available for use. In our implementation here, we will create a custom hashing algorithm which will take the character code of each character in the key, hold a value for the sum of the codes and a value for the product, add the sum and product, and use that total as the hash. If an overflow error happens at any point, we can reset to 0 and proceed again. This is a decent hashing algorithm for demonstration purposes, as the sum + product, instead of just one, will help prevent collisions, but in a production system, a better, or built-in, algorithm should be used instead.
Once the hashing algorithm is in place, we can use it to determine the index in our internal array where we will be storing the value associated with the key. To find the index: hash the key into a number, then take the modulus of the key with the size of our internal array:
int index = hashCode % internalArray.length.
A decent hashing algorithm will mitigate the first issue, and occasionally resizing the inner array will mitigate the second. However, since there is no way to prevent a collision, we must account for it. The section below outlines how to handle collisions with seperate chaining.
To handle the issue of collisions, seperate chaining uses a linked list to store each key value pair of keys which hash to the same value. In the internal array, instead of storing values directly, we'll store a linked list node reference to a node which holds the key and value of the first item inserted with its hash value. At that point, when a collision occurs, subsequent entries can be appended to the end of the list. When inserting, retrieving, and deleting, we'll need to do the normal process of hashing and accessing the appropriate array index, then iterate over the linked list until we find the value we're looking for. Note: if you are not familiar with linked lists, please visit our introduction to linked lists.
In the table below, we'll take a look at the average case and worst case time complexity of hashed table operations.
|Operation||Average Case Complexity||Worst Case Complexity||Notes|
|Get Value by Key||Worst case accounts for hash collisions.|
|Search for Value||Hash tables, by default, have no way of searching over values themselves.|
|Set Key/Value||Worst case accounts for hash collisions.|
|Delete Key||Worst case accounts for hash collisions.|
Our implementation below brings everything together: internal array for O(1) access, simple hashing algorithm for getting the array index for a key, seperate chaining for handling collisions, and internal array resizing when the structure grows. Every time the structure reaches 2/3 capacity, the internal array will double in size and all items will be re-assigned: