As a software company we may not always be versed on the exact components required for a solution. This is where our innovative Zymr culture shines. For example, on a recent project at work we had the requirement to enable a search function throughout an entire application. Similar to the Facebook omnisearch bar that allows visitors to find other users, groups, posts and so on, we needed to enable our users to easily search for most of the objects available in the system. The application we were building was in the sports domain, so in this case the search functionality applied to things like player name, team name, league name and so on. Of course one of the critical parameters was that the search experience needed to be lightning fast so that the user would only need to wait for a few seconds rather than a few minutes (The Importance of Real-Time Big Data).
As is the case in any custom development project, we had our research cut out for us. Our first thought was to use indices in mongo given that we were using MongoDB. But we quickly realized that the realm of all available data in sports was too vast for this approach, so we began to research more specific search frameworks and techniques. The two options we narrowed down to were Lucene/Solr and ElasticSearch. Two strong opensource options with varying advantages made it a challenge to choose, but we finally decided to go with ElasticSearch.
ElasticSearch was good because its existing foundation on Lucene means that a lot of the heavy lifting has already been done and distilled into a simple search API. We created a river and indices for our collections and identified ways to query on ElasticSearch for the required data. We were happy with the speed at which this technology returned data based on search terms. We created tags/tokens for different types of objects, making it easier for users to get relevant search results.
ElasticSearch also provides various advanced options. For example it is possible to define whether an individual wants to search for the whole word, prefix, or both. There is also an option to boost certain results, allowing results which have whole word matches against established prefixes show up at the top of the results. These are just a few examples of useful features ElasticSearch provides to enhance the search experience.
These built-in options work great for simple scenarios, but may not always be suitable for extended requirements. The following is an explanation of a certain solution we created to complete the user search experience we required.
Search for any object (ie. Team, player or league) from a single search box. The search should support the following:
We were using the default setting for elastic search to analyze, index, and search the data. The term used for this is “tokenizer”. Tokenizer is basically the mechanism through which different tokens are created for the field on which the query will run to find appropriate results. The default setting uses “Standard Tokenizer” which creates only one token for a field value to base its search on. This is a problem because, for example, results can only be pulled if the search name is written in exactly the correct order as it is stored in the database. This does not meet the requirements established above.
We researched different analyzers and tokenizers for elastic search and found a few that were suitable for our requirements. These included ngram, white-space, pattern and keyword. Each of these seemed appropriate enough for our requirements to try out.
After trying the various tokenizers to see which one best suited our purposes, we settled on ngram. We found it very useful as it provided options to generate any number of tokens for the data, making it easy to search all those tokens to find the matching item.
We recreated all of our indices on elastic with the ngram tokenizer for analysis and search. One challenge we ran into was that our setup was returning more items than required. This was because there were multiple tokens for each item, so the search string entered would pull out too many tokens and all their affiliated results leading to too many values returned.
After further research we decided to decrease the token count that was generated, and to align our tokens more closely with search usage patterns. This was achieved by using the “WildCard” search query in the Java API. This made it possible to match any prefix or postfix string from tokens and return the correct result. We also created queries that operate on a number of search combinations, like being able to search the exact string and then reversing the word order and searching that. This made the order of the name typed in by a user irrelevant to the quality of the search result.
These kinds of solutions came together and enabled us to meet our initial requirements. This project was a great learning experience for us, and we were pleasantly surprised by the robust functionality afforded by ElasticSearch and its various supporting tokenizers.
Everything you need to know about outsourcing technology development
Access a special Introduction Package with everything you want to know about outsourcing your technology development. How should you evaluate a partner? What components of your solution that are suitable to be handed off to a partner? These answers and more below.