Recently I've been working on a side project that has a search component at it's core. I'm pulling a set of data from an external source, loading it into my database and pushing the data to Elasticsearch. There are a handful of fields that I need to search over and I wanted to do a google style search (single search bar) rather than a faceted search. This means I have to search for a set of terms over multiple fields. Luckly this is something that Elasticsearch supports.
Since my dataset is fairly complex and fairly large I want to keep things simple. Ideally I would build a set of keywords based off the incoming dataset, tag each item with it's appropriate keywords and then push the data into Elasticsearch indexing the keyword field. Since I am working on this project alone, I decided this was something I could push back to a later date. I still had to build all of the infrastructure to bring the data in, transform the data, push the data and then build an algorithm to find the data in Elasticsearch. Since this was a lot of work for one person, I wanted to build the basics, deploy and iterate.
After spending some time thinking and reading about the problem I looked at both field-centric and term-centric queries. I first looked at field-centric queries but this tries to find the most fields matching any words. What I really wanted was to find the most words matched across ALL the fields. So this caused a problem because a term that was found in multiple fields could outscore a field that had many of the terms in it. Lets look at an example.
I'm going to use an example dataset based on the project I am working on. I'm building a search engine to help me discover new comic books and manage my current collection. So lets take an example query....
Q: The Walking Dead
This is a pretty basic query because it has the name of a comic book series and a match should be found in the title. However, a problem can arise when using field-centric search if multiple fields have the terms apposed to one field having all the terms. Lets look a little closer. Assume we have the documents below.
(full_title: "The Walking Dead 150" Writers: "Robert Kirkman" Pencils: "Charlie Adlard" ...) (full_title: "The Living Dead" Writers: "John Walking" Pencils: "John Walking" ...)
As you can see the second document has the term Walking in the writers and pencils field and the title has Dead in it. So we have a match in 3 fields for 2 terms. Since we are doing a field-centric search this document could have a higher score. For my use-case this doesn't make sense.
I do realize there are a lot of non specific queries (an example of this might be "newest horror comics") that might not be exact matches, but after doing a bunch of testing term-centric works really well for these as well. Now, let's talk a little more specifically about term-centric searches.
Term-centric search allows us to search for all terms across all fields. Elasticsearch treats all of your indexed fields as one field and looks for each term in all your fields. The big difference here is that the terms being searched must appear in any of the fields apposed to them being found in all fields (field-centric). Elasticsearch implements this by using the
multi_match query with the
Setting this up in Elasticsearch is pretty straight forward. All you really need to do is set the query type to
multi_match and the type to
cross_fields. Below is the implementation I wrote for my current project which is built in django using the Elasticsearch_dsl library (this is a lightweight library used to interact with Elasticsearch. It abstracts some of the boilerplate needed to interact with Elasticsearch).
def _search(self, term, page, per_page): """ Term-centric search across an issues full_title, writers, pencils, publisher_name and genre fields. The search results are returned based on their TF/IDF score and sorted by their on_sale_date. """ # The Elasticsearch_dsl pagination does not just set the size, it # subtracts the size from the from parameter. So we need to adjust # for this calculation as we do our pagination. # https://github.com/elastic/elasticsearch-dsl-py/blob/master/elasticsearch_dsl/search.py page_from = (page - 1) * per_page page_size = page * per_page or per_page query = Q( 'multi_match', query=term, type='cross_fields', operator='and', fields=['full_title', 'writers', 'pencils', 'publisher_name', 'genre'] ) search = Search(Elasticsearch()).query(query).sort('-on_sale_date')[page_from:page_size] # Call ElasticSearch and set up pagination, we only want to display # 10 results at a time. logger.info("Search query for %s has been made: %s" % (term, query)) return search.execute()
The main part to pay attention to is the query:
query = Q( 'multi_match', query=term, type='cross_fields', operator='and', fields=['full_title', 'writers', 'pencils', 'publisher_name', 'genre'] )
We are setting the query type to
multi_match, setting the query to the
term parameter passed to the search function, setting the type to
cross_fields, we are going to use the
and operator (this tells Elasticsearch that all terms are required) and then we set the fields parameter to the list of fields we would like to search over.
You should now have a fully functional term-centric search. You should be aware that there are situations where you may not get a "full" resultset back. An example would be when an exact match is found. When an exact match is found, this logic will only return one result. This works perfect for my use-case but you may find this to be a problem if you want multiple results as much as possible.
Overall I find using term-centric search perfect for what I'm doing at this stage. I'm able to search for all terms across all fields with high quality results. As I get further along in my project, I plan on testing a keyword based search where I can properly tag each document and then search over the set of keywords. I'm not making this a priority as term-centric search is working great so far.