Basic understanding of text search in elasticsearch

December 19th, 2016 by Dieter Vanden Eynde

Elasticsearch gets a lot of its power from how it works with analyzers and inverted indices. These inverted indices store your text data in a format optimized for search and allow for very fast lookups, yielding fast search results. Not understanding how these inverted indices are being used in text search will most likely lead to confusing search results down the road.

Analyzers

Elasticsearch allows you to define how text data is being processed. With analyzers you can define how your data is being manipulated, either before storing the data or during search time (more later). This gives you great control over how your data is used in search. This is most commonly used to standardize text, for example lowercasing it or converting special characters (é => e).

Analyzing happens in 2 stages (simplified):

  1. split the string into chunks (tokenizing)
  2. apply some formatting on each of those tokens (lowercasing, convert special chars, …)

The default behavior is described by the standard analyzer. It will split your text by word and those words will be lowercased. See it in action:

GET /_analyze?analyzer=standard&text=We are madewithlove

{
   "tokens": [
      {
         "token": "we"
      },
      {
         "token": "are"
      },
      {
         "token": "madewithlove"
      }
   ]
}

It is important to understand here that these are the tokens that end up in the inverted index and these tokens are used for search. Your original text does not exist in the inverted index (and thus for search), each token becomes equally important. Figuring out why a query didn’t match your document usually starts with knowing what tokens ended up in the inverted indices.

Analyzing during index time

During index time, it’s very simple. You define which analyzer to use for which field in your mapping. When data is being indexed for that field, it will get pushed through that analyzer before ending up in the inverted index. In the above example, we would end up with 3 tokens in an inverted index.

Analyzing during search time

When it comes to search we first need to make the difference between exact matching and full text search. An exact match query (like term) will take your search text and check if it exists in an inverted index. It will not apply any analyzers, the search text needs to exist in the inverted index. Example:

This will find a result because our analyzer stored a we token in an inverted index during indexing:

GET /companies/company/_search
{
    "query": {
        "term": {
            "name": "we"
        }
    }
}

{
   "hits": {
      "total": 1,
      "hits": [
         {
            "_source": {
               "name": "We are madewithlove"
            }
         }
      ]
   }
}

This will not find a result because we have do not have a we are madewithlove token:

GET /companies/company/_search
{
    "query": {
        "term": {
            "name": "we are madewithlove"
        }
    }
}

{
   "hits": {
      "total": 0,
      "hits": []
   }
}

A fulltext query (like match) will apply analyzers on your search term as well. Which means that our search term becomes a set of tokens again, which can be compared with the tokens in the inverted index. By default elasticsearch will use the analyzer defined in the mapping for that field. So in our example:

GET /companies/company/_search
{
    "query": {
        "match": {
            "name": "we are madewithlove"
        }
    }
}

{
   "hits": {
      "total": 2,
      "hits": [
          {
             "_score": 0.26574233,
             "_source": {
                "name": "We are madewithlove"
             }
          },
          {
             "_score": 0.05758412,
             "_source": {
                "name": "We are madewithmorelove"
             }
          }
      ]
   }
}

After analyzing, elasticsearch will search for: (name:we OR name:are OR name:madewithlove). Default behavior is OR, which means that a document with we are madewithmorelove will also match, but with a lower score.

Caveat: troubles with partial matching

At some point you’ll have to allow partial matching, for example made should also match the above example documents. This can be done by using an ngram analyzer, which will split your text in partial tokens. So the inverted index is filled with the partial matches. Example:

PUT /companies
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_ngram_analyzer" : {
                    "tokenizer" : "my_ngram_tokenizer"
                }
            },
            "tokenizer" : {
                "my_ngram_tokenizer" : {
                    "type" : "edgeNGram",
                    "min_gram" : "2",
                    "max_gram" : "10"
                }
            }
        }
    },
    "mappings": {
        "company": {
            "properties": {
                "name": {
                    "analyzer": "my_ngram_analyzer",
                    "type": "string"
                }
            }
        }
    }        
}

If we were to test this analyzer with madewithlove:

GET /companies/_analyze?analyzer=my_ngram_analyzer&text=madewithlove
{
   "tokens": [
      {
         "token": "ma"
      },
      {
         "token": "mad"
      },
      {
         "token": "made"
      },
      {
         "token": "madew"
      },
      {
         "token": "madewi"
      },
      {
         "token": "madewit"
      },
      {
         "token": "madewith"
      },
      {
         "token": "madewithl"
      },
      {
         "token": "madewithlo"
      }
   ]
}

A term (which is exact matching), would also match the document on madewith, because it exist exactly in the list of tokens.

The problem here is that, as we said before, the same analyzer is also applied to the search term. This will return results like:

GET /companies/company/_search
{
    "query": {
        "match": {
            "name": "mamamia"
        }
    }
}

{
   "hits": {
      "total": 2,
      "hits": [
         {
            "_score": 0.0021728154,
            "_source": {
               "name": "madewithmorelove"
            }
         },
         {
            "_score": 0.0021728154,
            "_source": {
               "name": "madewithlove"
            }
         }
      ]
   }
}

Because the same ngram analyzer is applied, elasticsearch searches for name:ma OR name:mam OR name:mama OR name:mamam OR name:mamami OR name:mamamia. Both those documents have a token ma and therefore match and are equally relevant!

This is not what we want in most cases, we should not be applying the same analyzer during search. For fulltext queries you can define the search time analyzer:

GET /companies/company/_search
{
    "query": {
        "match": {
            "name": {
                "query": "mamamia",
                "analyzer": "standard"
            }
        }
    }
}

{
   "hits": {
      "total": 0,
      "hits": []
   }
}

Note that search for made will work here, as we expect:

GET /companies/company/_search
{
    "query": {
        "match": {
            "name": {
                "query": "made",
                "analyzer": "standard"
            }
        }
    }
}

{
   "hits": {
      "total": 2,
      "hits": [
         {
            "_source": {
               "name": "madewithmorelove"
            }
         },
         {
            "_source": {
               "name": "madewithlove"
            }
         }
      ]
   }
}

Conclusion

This is the basis and is crucial to understanding and building a usable search query. If you are stuck, use the analyze and explain API’s to figure what data is actually being search for.

Resources

Comments