When initially creating our website, we knew that we would not be able to launch a viable product without implementing a store search functionality. We decided that our implementation should be reusable across different types of items, so that if we ever needed to implement searching for anything other than store products, we could do so using the same codebase. After doing some analysis on the requirements and how to fulfill them, we decided to build the functionality ourselves using a lightweight implementation based on basic concepts of information retrieval. We have since made our implementation open source and available for others to directly consume.
The first half of this dev page will outline how to use our open source library to build a search engine. The second half will introduce the concepts behind the implementation and weight the advantages and disadvantages of the approach we've taken.
Text Process Library
Our text-process library drives the search functionality implemented on this website. It is available as a consumable package and the code is available on Git:
- Npm Package to directly install and use.
- Github Project to view the source code.
To get started now: you can directly install via npm:
Implementing a Basic Search Engine
A basic search engine can be implemented by splitting a search query into tokens, splitting each search subject into tokens, and checking which subject has the most tokens in common with the search query. If we want to let a user search for text in a few text files, the sequence would look like this:
- A user submits a search query.
- The system tokenizes the search query.
- For each text file:
- The system tokenizes the contents of the text file.
- The tokens of the query are compared to that of the text file.
- The system returns the text files in order of how many intersections there are between the query tokens and tokens in each file.
We will go into more detail on what tokens in sections below; for now it can be considered a special form of a word which makes it easier to perform comparisons against other words. If we were to implement this type of search engine, we can use code such as:
Each text file is assigned some rank according to rankTextFile where a comparison is made between the query tokens and the file tokens. At the end, the results are sorted based on the rank value, then returned in that order. This basic approach can be modified to have different behavior depending upon what is required.
Enhancing the Basic Search Implementation
There are multiple ways the approach above can be enhanced to fine tune the search algorithm to match the required behavior without changing the approach altogether:
- Consider the number of times a subject contains any particular query token. Instead of adding 1 or 0 if the subject does or doesn't contain the token, consider the number of times the token appears. This approach can be tricky, as the number of occurences by itself doesn't necessarily indicate relevance or lack of relevance, but it can possibly yield more insight into how relevant the particular subject is.
- Give a negative weight if a subject does not contain a certain token. The approach above will only give scores >=0. An alternative approach might be to give negative weights if certain critical keywords are missing.
- Give a higher weight if a subject does contain a certain token. Instead of treating all tokens equally, we can assign higher values to certain pre determined tokens than others. We can have some explicit configuration for this or have the system record queries over time and try to automatically determine which tokens should have a higher weight.
- Filter low ranking results. The example above returns a sorted array of all subjects, but we can choose to make some threshold in which certain results will be filtered out. Examples of filters: rank is less than some absolute value, rank is in some lower percentile of all ranks, etc.
- Check the edit distance between query words for more inclusiveness. Instead of using a "yes or no" strategy to check query word matches, we can use the StringDistanceCalculator provided by the library to check if how "close" strings are to each other. We can assign a coefficient based on how similar strings are so we can catch mispelled words, similar words, etc.
Searching Over Objects with Multiple Properties
So far we have only searched over string subjects, but a common search requirement is to be able to search over objects with multiple properties. For example, our dev pages has multiple properties such as title, tags, description, and categories. Our search algorithm considers all of these properties when ranking the pages and returning the results by assigning weights to each property, then running a token comparison for each property:
The code example above is a simplified version of the way our search engine is implemented for our store catalog and dev pages. The sections below will dive into the details and concepts behind our tokenizer implementation.
Stemming
Within our text-process library, we have leveraged an open source implementation of a stemmer which is used by the tokenizer. Stemming is the process of converting a word into some common base form of that word. This will allow us to map any form of a particular word, such as "running", to a base form, such as "run". Stemming on its own will sometimes give an invalid word and does not consider the context in which the word is present. The list below shows a few example mappings:
- running -> run
- error -> error
- worrying -> worri (invalid word)
- exemplifying -> exemplifi (invalid word)
- contained -> contain
For our purposes, it does not matter if the stemmed word itself, the stem, is a valid word itself so long as the search engine itself consistently maps the same words to the same stems. After the search engine stems each word in the query + subject, it can run a comparison against the stemmed words, thus ensuring we can detect different versions of the same word within a reasonable degree of accuracy. There is a similar process called lemmatization which takes this a step further by considering the context in which a particular word is in, such as noun vs verb, and is able to convert the word to its correct base word which is called the Lemma. While more accurate, it is also more expensive to perform this operation. For our purposes we've chosen not to use lemmatization for our search engine.
Tokenizing
Tokenizing is the process of converting a string phrase into a list of individual tokens. In this context, a token is an alphanumeric stemmed word. Our tokenizing process looks something like this:
- Replace all instances of characters such as dashes and underscores to spaces.
- Remove all other non alphanumeric characters.
- Make the entire string lowercase.
- Replace all instances of one or more whitespace to a single space character.
- Split the string into an array of strings with the space character as the delimiter.
- Filter the list to remove stop words. Stop words are words which are highly common, such as "the", "and", "I". In certain cases these words can be filtered out of strings when performing processing, analysis, or other operations in which these words are superfluous.
- Map the remaining words to their stemmed form using the Stemmer.
Once the tokenizing process is complete, comparisons between queryies and search subjects becomes much simpler.
Advantages and Disadvantages
The advantages of our approach:
- No indexing is required. Some search implementations require that subjects be indexed in some capacity ahead of time so that incoming queries would be performed against the index rather than the actual search subjects. We could take this approach ourselves by indexing the tokens for each subject instead of tokenizing them in each search, but the tokenizing is not an expensive operation compared to processes such as lemmatization, so it is generally not required. Thus, we do not need to concern ourselves with maintaining an index in a database or file.
- The implementation is lightweight. We can implement a new search engine for a particular set of subjects or documents relatively quickly using this approach. It is also easier to maintain the codebase as a result.
- The computation requirements are low compared to other search implementations. The system can perform a search without using too many system resources.
The disadvantages of our approach:
- Users are limited to keyword-style searches. For example: the user cannot submit a search such as 'what dev pages talk about web design?' and get a resultset that makes sense as an answer to that question. They will still get results which include some of those keywords, but the results may not fully answer their question.
- There is no implementation for synonyms. If the user is asking for something using a synonym of some word which is not present in a subject, that subject will not be returned as a part of the result set. We could solve this using indexing and/or lookup table strategies, but that falls outside of the scope of this article.
Overall, we have made a tradeoff of simplicity and speed in favor of query expresiveness.
Improving Performance with Binary Search and Sets
The code examples above got the intersection of two arrays by iterating over the second array for each item in the first array. In the context of running a search: the algorithm would iterate over the tokens of a subject once per every query word. If we have n query tokens and m subject tokens, we will perform n * m operations. This is not a problem if the query and subject both contain a small number of tokens, but if either of them have the possibility of containing more than a small number of elements, it would be wise to improve that performance.
This can be made more efficient by putting the tokens in sets, then checking for each query token using fast, constant time set operations. However, this will require exact matches, as checking for inclusion in a set requires the exact value to compare. This could be made more sophisticated by using a map, for example, to get another set of relative terms given a particular token, then compare against those sets as well.
Certain types of trees may also help accomplish this goal. These optimizations are worth briefly mentioning, but further exploration of these concepts falls outside of the scope of this article.
Further Reading
The book Introduction to Information Retrieval provides an execellent introduction to concepts related to searching, querying, and the topics discussed above. It also covers additional topics in great detail, including:
- Stateless Search Engines
- Document Indexing
- Document Ranking and Scoring
- Text Classification
- Web Crawling
In addition to introducing the topics, certain implementation details and practical efficiency improvements are also discussed, as well as overall practicality and use cases for each type of approach. The book covers each topic in great detail ranging from simple implementations to enterprise level patterns. The reader can choose to read the first few chapters to gain a basic understanding, enough to implement a basic search engine themselves, or they can read through to the end to gain a more complete understanding of all of the topics related to querying, indexing, searching, and crawling.