A trie is an interesting type of data structure which represents words using a tree of characters. It supports fast insertions and deletions and is well suited for autocomplete functionality. In this article, we'll introduce the data structure, outline the core concepts, discuss how to implement insertions, deletions, etc, then show a full implementation in multiple languages. If you'd like to skip to the code, feel free to jump to the bottom of the article.
A trie, also known as a digital tree or prefix tree, is a type of search tree where each node stores individual characters in such a way that words are represented via parent-child node relationships. This structure is generally used for strings but can also be used for other types of data. Unlike many other commonly used tree structures, tries are not binary trees by design. We can add words by recursively inserting a node for each character in the word, check if a word exists via depth-first search, and delete a word by traversing the structure, then removing certain nodes and markers. A trie is a natural fit for autocomplete: given a prefix, find how many words in the trie begin with that prefix.
The diagram below shows a trie tree which has the words ant, ants, ate, bat, bats, born, bot, and bots. The asterisk "*" marks that a particular node represents the end of a word.
All of these operations should be efficient; we should not need to iterate over every word in the trie to perform these operations. To satisfy these requirements, we will internally use a tree, where each node represents a character in one or more words. Each node will have one child for every word which has all nodes upto the current node as a prefix. In order to quickly access child nodes by character, we will use a map structure to provide O(1) access. If you are not familiar with maps, please visit our introduction to hash tables.
Insert a Word
- If it does not, create a new node with that characfter and add it as a child of visitNode.
We will mark a node as an end of a word by setting a child reference's key to an asterisk "*" while setting its value to null, true, or some other non-node value. If we wanted to accept "*" as a character itself, we would need to find some other way to mark the end of a word, possibly by creating a seperate boolean field on the node object itself.
Check if a Word Exists
- If it does not, return false and stop searching, as the word does not exist.
- If it is marked as the end of a word, return true.
- Otherwise, return false.
This algorithm is similar to the algorithm for inserting a word.
Delete a Word
- If it does, set visitNode to that node.
- Otherwise, set isSearching to false.
- Unmark visitNode as the end of a word.
- Create a variable prevNode.
- While nodeStack is not empty:
- Set prevNode to visitNode.
- Pop from nodeStack and set visitNode to that popped value.
- If prevNode is empty, remove visitNode's reference to it.
Note that unlike the other operation, we are using a stack. If we were using a recursive function implementation, the stack would be implicitly represented by the call stack, but since our approach uses looping, we need to explicitly use our own stack object. We need to move through the entire path before we can determine which nodes can and cannot be entirely removed.
Get a List of Words Which Start with a Prefix
- If it does not, set isSearching to false.
- Get all of the next nodes visitNode points to.
- For each next node, add it to the nodeQueue and add prefix + the next node's character to queuePrefixes.
- If visitNode is marked as the end of a word, add prefix to words.
- Shift nodeQueue and set visitNode to the shifted value.
- Shift queuePrefixes and set prefix to the shifted value.
We can use the same approach to get a list of all words which exist in the trie. In our implementation shown at the bottom of this article, we use an empty prefix to query all words.
Statistics and Metadata for Words
In the beginning of this article, we stated that this data structure can be used to implement an efficient autocomplete functionality. As we've described the operations so far, we are only storing the words themselves, not any data about which ones are more common in natural language, which ones are searched for most often, etc. We can store this information directly within the trie structure by allowing nodes which are marked as ends of words to carry this additional information. Then, when we retrieve a list of words which start with some prefix, or all words in the structure, we can return an object which encapulates a word and its metadata instead of just a list of words.
Space Time Complexity
In our complexity analysis, we will use n to refer to the number of nodes in the trie structure and k to refer to the number of characters in an input string. The table below outlines the complexity for the operations described above.
|Space||One node per character per word, though some words will share nodes.|
|Insert a Word||Complexity is based on the length of the string to insert.|
|Check if a Word Exists||Complexity is based on the length of the string to insert.|
|Remove a Word||Complexity is based on the length of the string to insert.|
|Get Words with Prefix||Worst case scenario, we fetch every word in the trie and touch each node once.|
In this implementation, we will use iterative algorithms instead of recursive algorithms to perform the operations discussed in the previous sections. Our TrieNode class will encapsulate the node's requirement of having a map of children, being marked as the end of a word, and holding a character value. The CustomTrie class will hold a root node with an empty/throwaway character. Its chilren will correspond to the first letter of each word it holds.