Basic queries
Elasticsearch has extensive search and data analysis capabilities that are exposed in forms of different queries, filters, aggregates, and so on. In this section, we will concentrate on the basic queries provided by Elasticsearch. By basic queries we mean the ones that don't combine the other queries together but run on their own.
The term query
The term query is one of the simplest queries in Elasticsearch. It just matches the document that has a term in a given field - the exact, not analyzed term. The simplest term query is as follows:
{ "query" : { "term" : { "title" : "crime" } } }
It will match the documents that have the term crime in the title field. Remember that the term query is not analyzed, so you need to provide the exact term that will match the term in the indexed document. Note that in our input data, we have the title
field with the value of Crime
and Punishment
(upper cased), but we are searching for crime
, because the Crime
terms becomes crime
after analysis during indexing.
In addition to the term we want to find, we can also include the boost attribute to our term query, which will affect the importance of the given term. We will talk more about boosts in the Introduction to Apache Lucene scoring section of Chapter 6, Make Your Search Better. For now, we just need to remember that it changes the importance of the given part of the query.
For example, to change our previous query and give our term query a boost of 10.0
, send the following query:
{ "query" : { "term" : { "title" : { "value" : "crime", "boost" : 10.0 } } } }
As you can see, the query changed a bit. Instead of a simple term value, we nested a new JSON object which contains the value
property and the boost
property. The value of the value
property should contain the term we are interested in and the boost
property is the boost
value we want to use.
The terms query
The terms
query is an extension to the term
query. It allows us to match documents that have certain terms in their contents instead of a single term. The term
query allowed us to match a single, not analyzed term and the terms
query allows us to match multiple of those. For example, let's say that we want to get all the documents that have the terms novel or book in the tags field. To achieve this, we will run the following query:
{ "query" : { "terms" : { "tags" : [ "novel", "book" ] } } }
The preceding query returns all the documents that have one or both of the searched terms in the tags field. This is a key point to remember – the terms
query will find documents having any of the provided terms.
The match all query
The match all query is one of the simplest queries available in Elasticsearch. It allows us to match all of the documents in the index. If we want to get all the documents from our index, we just run the following query:
{ "query" : { "match_all" : {} } }
We can also include boost in the query, which will be given to all the documents matched by it. For example, if we want to add a boost of 2.0 to all the documents in our match all query, we will send the following query to Elasticsearch:
{ "query" : { "match_all" : { "boost" : 2.0 } } }
The type query
A very simple query that allows us to find all the documents with a certain type. For example, if we would like to search for all the documents with the book
type in our library index, we will run the following query:
{ "query" : { "type" : { "value" : "book" } } }
The exists query
A query that allows us to find all the documents that have a value in the defined field. For example, to find the documents that have a value in the tags
field, we will run the following query:
{ "query" : { "exists" : { "field" : "tags" } } }
The missing query
Opposite to the exists query, the missing query returns the documents that have a null value or no value at all in a given field. For example, to find all the documents that don't have a value in the tags
field, we will run the following query:
{ "query" : { "missing" : { "field" : "tags" } } }
The common terms query
The common terms query is a modern Elasticsearch solution for improving query relevance and precision with common words when we are not using stop words (http://en.wikipedia.org/wiki/Stop_words). For example, a crime and punishment query results in three term queries and each of them have a cost in terms of performance. However, the and
term is a very common one and its impact on the document score will be very low. The solution is the common terms query which pides the query into two groups. The first group is the one with important terms, which are the ones that have lower frequency. The second group is the one with less important terms, which are the ones with high frequency. The first query is executed first and Elasticsearch calculates the score for all of the terms from the first group. This way the low frequency terms, which are usually the ones that have more importance, are always taken into consideration. Then Elasticsearch executes the second query for the second group of terms, but calculates the score only for the documents matched for the first query. This way the score is only calculated for the relevant documents and thus higher performance can be achieved.
An example of the common terms query is as follows:
{ "query" : { "common" : { "title" : { "query" : "crime and punishment", "cutoff_frequency" : 0.001 } } } }
The query can take the following parameters:
query
: The actual query contents.cutoff_frequency
: The percentage (0.001 means 0.1%) or an absolute value (when property is set to a value equal to or larger than 1). High and low frequency groups are constructed using this value. Setting this parameter to 0.001 means that the low frequency terms group will be constructed for terms having a frequency of 0.1% and lower.low_freq_operator
: This can be set toor
orand
, but defaults toor
. It specifies theBoolean
operator used for constructing queries in the low frequency term group. If we want all the terms to be present in a document for it to be considered a match, we should set this parameter toand
.high_freq_operator
: This can be set toor
orand
, but defaults toor
. It specifies theBoolean
operator used for constructing queries in the high frequency term group. If we want all the terms to be present in a document for it to be considered a match, we should set this parameter toand
.minimum_should_match
: Instead of usinglow_freq_operator
andhigh_freq_operator,
we can useminimum_should_match
. Just like with the other queries, it allows us to specify the minimum number of terms that should be found in a document for it to be considered a match. We can also specifyhigh_freq
andlow_freq
inside theminimum_should_match
object, which allows us to define the different number of terms that need to be matched for the high and low frequency terms.boost
: The boost given to the score of the documents.analyzer
: The name of the analyzer that will be used to analyze the query text, which defaults to the default analyzer.disable_coord
: Defaults tofalse
and allows us to enable or disable the score factor computation that is based on the fraction of all the query terms that a document contains. Set it totrue
for less precise scoring, but slightly faster queries.Note
Unlike the
term
andterms
queries, the common terms query is analyzed by Elasticsearch.
The match query
The match
query takes the values given in the query
parameter, analyzes it, and constructs the appropriate query out of it. When using a match
query, Elasticsearch will choose the proper analyzer for the field we choose, so you can be sure that the terms passed to the match
query will be processed by the same analyzer that was used during indexing. Remember that the match
query (and the multi_match
query) doesn't support Lucene query syntax; however, it perfectly fits as a query handler for your search box. The simplest match (and the default) query will look like the following:
{ "query" : { "match" : { "title" : "crime and punishment" } } }
The preceding query will match all the documents that have the terms crime
, and, or punishment
in the title
field. However, the previous query is only the simplest one; there are multiple types of match query which we will discuss now.
The Boolean match query
The Boolean match
query is a query which analyzes the provided text and makes a Boolean query out of it. This is also the default type for the match query. There are a few parameters which allow us to control the behavior of the Boolean match
queries:
operator
: This parameter can take the value ofor
orand
, and controls which Boolean operator is used to connect the created Boolean clauses. The default value isor
. If we want all the terms in our query to be matched, we should use theand
Boolean operator.analyzer
: This specifies the name of the analyzer that will be used to analyze the query text and defaults to the default analyzer.fuzziness
: Providing the value of this parameter allows us to construct fuzzy queries. The value of this parameter can vary. For numeric fields, it should be set to numeric value; for date based field, it can be set tomillisecond
ortime
value, such as2h
; and for text fields, it can be set to0
,1
, or2
(the edit distance in the Levenshtein algorithm – https://en.wikipedia.org/wiki/Levenshtein_distance),AUTO
(which allows Elasticsearch to control how fuzzy queries are constructed and which is a preferred value). Finally, for text fields, it can also be set to values from 0.0 to 1.0, which results in edit distance being calculated as term length minus 1.0 multiplied by the provided fuzziness value. In general, the higher the fuzziness, the more difference between terms will be allowed.prefix_length
: This allows control over the behavior of the fuzzy query. For more information on the value of this parameter, refer to the The fuzzy query section in this chapter.max_expansions
: This allows control over the behavior of the fuzzy query. For more information on the value of this parameter, refer to the The fuzzy query section in this chapter.zero_terms_query
: This allows us to specify the behavior of the query, when all the terms are removed by the analyzer (for example, because of stop words). It can be set to none or all, with none as the default. When set to none, no documents will be returned when the analyzer removes all the query terms. If set it to all, all the documents will be returned.cutoff_frequency
: It allows piding the query into two groups: one with high frequency terms and one with low frequency terms. Refer to the description of the common terms query to see how this parameter can be used.lenient
: When set totrue
(by default it isfalse
), it allows us to ignore the exceptions caused by data incompatibility, such as trying to query numeric fields using string value.
The parameters should be wrapped in the name of the field we are running the query against. So if we want to run a sample Boolean match query against the title
field, we send a query as follows:
{ "query" : { "match" : { "title" : { "query" : "crime and punishment", "operator" : "and" } } } }
The phrase match query
A phrase match
query is similar to the Boolean
query, but, instead of constructing the Boolean clauses from the analyzed text, it constructs the phrase
query. You may wonder what phrase is when it comes to Lucene and Elasticsearch – well, it is two or more terms positioned one after another in an order. The following parameters are available:
slop
: An integer value that defines how many unknown words can be put between the terms in thetext
query for a match to be considered a phrase. The default value of this parameter is0
, which means that no additional words are allowed.analyzer
: This specifies the name of the analyzer that will be used to analyze the query text and defaults to the default analyzer.
A sample phrase match
query against the title field looks like the following code:
{ "query" : { "match_phrase" : { "title" : { "query" : "crime punishment", "slop" : 1 } } } }
Note that we removed the and
term from our query, but because the slop is set to 1,
it will still match our document because we allowed one term to be present between our terms.
The match phrase prefix query
The last type of the match query is the match phrase prefix
query. This query is almost the same as the phrase match
query, but in addition, it allows prefix matches on the last term in the query text. Also, in addition to the parameters exposed by the match phrase query, it exposes an additional one – the max_expansions
parameter, which controls how many prefixes the last term will be rewritten to. Our example query changed to the match_phrase_prefix
query will look as follows:
{ "query" : { "match_phrase_prefix" : { "title" : { "query" : "crime punishm", "slop" : 1, "max_expansions" : 20 } } } }
Note that we didn't provide the full crime and punishment phrase, but only crime punishm
and still the query would match our document. This is because we used the match_phrase_prefix
query combined with slop set to 1
.
The multi match query
It is the same as the match
query, but instead of running against a single field, it can be run against multiple fields with the use of the fields
parameter. Of course, all the parameters you use with the match
query can be used with the multi match
query. So if we would like to modify our match
query to be run against the title
and otitle
fields, we will run the following query:
{ "query" : { "multi_match" : { "query" : "crime punishment", "fields" : [ "title^10", "otitle" ] } } }
As shown in the preceding example, the nice thing about the multi match
query is that the fields defined in it support boosting, so we can increase or decrease the importance of matches on certain fields.
However, this is not the only difference when it comes to comparison with the match
query. We can also control how the query is run internally by using the type
property and setting it to one of the following values:
best_fields
: This is the default behavior, which finds documents having matches in any field from the defined ones, but setting the document score to the score of the best matching field. The most useful type when searching for multiple words and wanting to boost documents that have those words in the same field.most_fields
: This value finds documents that match any field and sets the score of the document to the combined score from all the matched fields.cross_fields
: This value treats the query as if all the terms were in one, big field, thus returning documents matching any field.phrase
: This value uses thematch_phrase
query on each field and sets the score of the document to the score combined from all the fields.phrase_prefix
: This value uses thematch_phrase_prefix
query on each field and sets the score of the document to the score combined from all the fields.
In addition to the parameters mentioned in the match
query and type
, the multi match
query exposes some additional ones allowing more control over its behavior:
tie_breaker
: This allows us to specify the balance between the minimum and the maximum scoring query items and the value can be from 0.0 to 1.0. When used, the score of the document is equal to the best scoring element plus thetie_breaker
multiplied by the score of all the other matching fields in the document. So, when set to0.0,
Elasticsearch will only use the score of the most scoring matching element. You can read more about it in The dis_max query section in this chapter.
The query string query
In comparison to the other queries available, the query string
query supports full Apache Lucene query syntax, which we discussed earlier in the Lucene query syntax section of Chapter 1, Getting Started with Elasticsearch Cluster. It uses a query parser to construct an actual query using the provided text. An example query string query will look like the following code:
{ "query" : { "query_string" : { "query" : "title:crime^10 +title:punishment -otitle:cat +author:(+Fyodor +dostoevsky)", "default_field" : "title" } } }
Because we are familiar with the basics of the Lucene query syntax, we can discuss how the preceding query works. As you can see, we wanted to get the documents that may have the term crime in the title field and such documents should be boosted with the value of 10. Next, we wanted only the documents that have the term punishment in the title field and we didn't want documents with the term cat in the otitle
field. Finally, we told Lucene that we only wanted the documents that had the fyodor
and dostoevsky
terms in the author field.
Similar to most of the queries in Elasticsearch, the query string
query provides quite a few parameters that allow us to control the query behavior and the list of parameters for this query is rather extensive:
query
: This specifies thequery
text.default_field
: This specifies the default field the query will be executed against. It defaults to theindex.query.default_field
property, which is by default set to_all
.default_operator
: This specifies the default logical operator (or
orand
) used when no operator is specified. The default value of this parameter isor
.analyzer
: This specifies the name of the analyzer used to analyze the query provided in the query parameter.allow_leading_wildcard
: This specifies if a wildcard character is allowed as the first character of a term. It defaults totrue
.lowercase_expand_terms
: This specifies if the terms that are a result of query rewrite should be lowercased. It defaults totrue
, which means that the rewritten terms will be lowercased.enable_position_increments
: This specifies if position increments should be turned on in the result query. It defaults totrue
.fuzzy_max_expansions
: This specifies the maximum number of terms into which fuzzy query will be expanded, if fuzzy query is used. It defaults to50
.fuzzy_prefix_length
: This specifies the prefix length for the generated fuzzy queries and defaults to0
. To learn more about it, look at thefuzzy
query description.phrase_slop
: This specifies the phrase slop and defaults to0
. To learn more about it, look at thephrase match
query description.boost
: This specifies theboost
value which will be used and defaults to1.0
.analyze_wildcard
: This specifies if the terms generated by the wildcard query should be analyzed. It defaults tofalse
, which means that those terms won't be analyzed.auto_generate_phrase_queries
: specifies if the phrase queries will be automatically generated from the query. It defaults tofalse
, which means that the phrase queries won't be automatically generated.minimum_should_match
: This controls how many of the generated Booleanshould
clauses should be matched against a document for the document to be considered a hit. The value can be provided as a percentage; for example, 50%, which would mean that at least 50 percent of the given terms should match. It can also be provided as an integer value, such as 2, which means that at least 2 terms must match.fuzziness
: This controls the behavior of the generatedfuzzy
query. Refer to thematch
query description for more information.max_determined_states
: This defaults to 10000 and sets the number of states that the automaton can have for handling regular expression queries. It is used to disallow very expensive queries using regular expressions.locale
: This sets the locale that should be used for the conversion of string values. By default, it is set toROOT
.time_zone
: This sets the time zone that should be used by range queries that are run on date based fields.lenient
: This can take the value oftrue
orfalse
. If set totrue
, format-based failures will be ignored. By default, it is set tofalse
.
Note that Elasticsearch can rewrite the query string
query and, because of that, Elasticsearch allows us to pass additional parameters that control the rewrite method. However, for more details about this process, go to the Understanding the querying process section in this chapter.
Running the query string query against multiple fields
It is possible to run the query string
query against multiple fields. In order to do that, one needs to provide the fields parameter in the query body, which should hold the array of the field names. There are two methods of running the query string query against multiple fields: the default method uses the Boolean
query to make queries and the other method can use the dis_max
query.
In order to use the dis_max
query, one should add the use_dis_max
property in the query body and set it to true
. An example query will look like the following code:
{ "query" : { "query_string" : { "query" : "crime punishment", "fields" : [ "title", "otitle" ], "use_dis_max" : true } } }
The simple query string query
The simple query string query uses one of the newest query parsers in Lucene - the SimpleQueryParser (https://lucene.apache.org/core/5_4_0/queryparser/org/apache/lucene/queryparser/simple/SimpleQueryParser.html). Similar to the query string query, it accepts Lucene query syntax as the query; however, unlike it, it never throws an exception when a parsing error happens. Instead of throwing an exception, it discards the invalid parts of the query and runs the rest.
An example simple query string query will look like the following code:
{ "query" : { "simple_query_string" : { "query" : "crime punishment", "default_operator" : "or" } } }
The query supports parameters such as query
, fields
, default_operator
, analyzer
, lowercase_expanded_terms
, locale
, lenient
, and minimum_should_match,
and can also be run against multiple fields using the fields
property.
The identifiers query
This is a simple query that filters the returned documents to only those with the provided identifiers. It works on the internal _uid
field, so it doesn't require the _id
field to be enabled. The simplest version of such a query will look like the following:
{ "query" : { "ids" : { "values" : [ "1", "2", "3" ] } } }
This query will only return those documents that have one of the identifiers present in the values array. We can complicate the identifiers
query a bit and also limit the documents on the basis of their type. For example, if we want to only include documents from the book types, we will send the following query:
{ "query" : { "ids" : { "type" : "book", "values" : [ "1", "2", "3" ] } } }
As you can see, we've added the type
property to our query and we've set its value to the type
we are interested in.
The prefix query
This query is similar to the term
query in its configuration and to the multi term
query when looking into its logic. The prefix
query allows us to match documents that have the value in a certain field that starts with a given prefix. For example, if we want to find all the documents that have values starting with cri
in the title
field, we will run the following query:
{ "query" : { "prefix" : { "title" : "cri" } } }
Similar to the term
query, you can also include the boost
attribute to your prefix query which will affect the importance of the given prefix. For example, if we would like to change our previous query and give our query a boost
of 3.0
, we will send the following query:
{ "query" : { "prefix" : { "title" : { "value" : "cri", "boost" : 3.0 } } } }
Note
Note that the prefix query is rewritten by Elasticsearch and because of that Elasticsearch allows us to pass an additional parameter, that is, controlling the rewrite method. However, for more details about that process, refer to the Understanding the querying process section in this chapter.
The fuzzy query
The fuzzy
query allows us to find documents that have values similar to the ones we've provided in the query. The similarity of terms is calculated on the basis of the edit distance algorithm. The edit distance is calculated on the basis of terms we provide in the query and against the searched documents. This query can be expensive when it comes to CPU resources, but can help us when we need fuzzy matching; for example, when users make spelling mistakes. In our example, let's assume that instead of crime, our user enters the crme
word into the search box and we would like to run the simplest form of fuzzy
query. Such a query will look like this:
{ "query" : { "fuzzy" : { "title" : "crme" } } }
The response for such a query will be as follows:
{ "took" : 81, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } }
Even though we made a typo, Elasticsearch managed to find the documents we were interested in.
We can control the fuzzy
query behavior by using the following parameters:
value
: This specifies the actual query.boost
: This specifies the boost value for the query. It defaults to1.0
.fuzziness
: This controls the behavior of the generatedfuzzy
query. Refer to thematch
query description for more information.prefix_length
: This is the length of the common prefix of the differencing terms. It defaults to0
.max_expansions
: This specifies the maximum number of terms the query will be expanded to. The default value is unbounded.
The parameters should be wrapped in the name of the field we are running the query against. So if we would like to modify the previous query and add additional parameters, the query will look like the following code:
{ "query" : { "fuzzy" : { "title" : { "value" : "crme", "fuzziness" : 2 } } } }
The wildcard query
A query that allows us to use *
and ?
wildcards in the values we search. Apart from that, the wildcard
query is very similar to the term query in case of its body. To send a query that would match all the documents with the value of the cr?me
term (?
matching any character) we would send the following query:
{ "query" : { "wildcard" : { "title" : "cr?me" } } }
It will match the documents that have all the terms matching cr?me
in the title
field. However, you can also include the boost
attribute to your wildcard
query which will affect the importance of each term that matches the given value. For example, if we would like to change our previous query and give our term query a boost
of 20.0
, we will send the following query:
{ "query" : { "wildcard" : { "title" : { "value" : "cr?me", "boost" : 20.0 } } } }
Note
Note that wildcard queries are not very performance oriented queries and should be avoided if possible; especially avoid leading wildcards (terms starting with wildcards). The wildcard
query is rewritten by Elasticsearch and because of that Elasticsearch allows us to pass an additional parameter, that is, controlling the rewrite method. For more details about this process, refer to the Understanding the querying process section in this chapter. Also remember that the wildcard
query is not analyzed.
The range query
A query that allows us to find documents that have a field value within a certain range and which works for numerical fields as well as for string-based fields and date based fields (just maps to a different Apache Lucene query). The range
query should be run against a single field and the query parameters should be wrapped in the field name. The following parameters are supported:
gte
: The query will match documents with the value greater than or equal to the one provided with this parametergt
: The query will match documents with the value greater than the one provided with this parameterlte
: The query will match documents with the value lower than or equal to the one provided with this parameterlt
: The query will match documents with the value lower than the one provided with this parameter
So for example, if we want to find all the books that have the value from 1700
to 1900
in the year
field, we will run the following query:
{ "query" : { "range" : { "year" : { "gte" : 1700, "lte" : 1900 } } } }
Regular expression query
Regular expression query allows us to use regular expressions as the query
text. Remember that the performance of such queries depends on the chosen regular expression. If our regular expression would match many terms, the query will be slow. The general rule is that the more terms matched by the regular expression, the slower the query will be.
An example regular expression query looks like this:
{ "query" : { "regexp" : { "title" : { "value" : "cr.m[ae]", "boost" : 10.0 } } } }
The preceding query will result in Elasticsearch rewriting the query. The rewritten query will have the number of term queries depending on the content of our index matching the given regular expression. The boost
parameter seen in the query specifies the boost
value for the generated queries.
The full regular expression syntax accepted by Elasticsearch can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#regexp-syntax.
The more like this query
One of the queries that got a major rework in Elasticsearch 2.0, the more like this
query allows us to retrieve documents that are similar (or not similar) to the provided text or to the documents that were provided.
The more like this query allows us to get documents that are similar to the provided text. Elasticsearch supports a few parameters to define how the more like this query should work:
fields
: An array of fields that the query should be run against. It defaults to the_all
field.like
: This parameter comes in two flavors: it allows us to provide a text which the returned documents should be similar to or an array of documents that the returning document should be similar to.unlike
: This is similar to thelike
parameter, but it allows us to define text or documents that our returning document should not be similar to.min_term_freq
: The minimum term frequency (for the terms in the documents) below which terms will be ignored. It defaults to2
.max_query_terms
: The maximum number of terms that will be included in any generated query. It defaults to25
. The higher value may mean higher precision, but lower performance.stop_words
: An array of words that will be ignored when comparing documents and the query. It is empty by default.min_doc_freq
: The minimum number of documents in which the term has to be present in order not to be ignored. It defaults to5
, which means that a term needs to be present in at least five documents.max_doc_freq
: The maximum number of documents in which the term may be present in order not to be ignored. By default, it is unbounded (set to0
).min_word_len
: The minimum length of a single word below which a word will be ignored. It defaults to0
.max_word_len
: The maximum length of a single word above which it will be ignored. It defaults to unbounded (which means setting the value to0
).boost_terms
: Theboost
value that will be used when boosting each term. It defaults to0
.boost
: Theboost
value that will be used when boosting the query. It defaults to1
.include
: This specifies if the input documents should be included in the results returned by the query. It defaults tofalse
, which means that the input documents won't be included.minimum_should_match
: This controls the number of terms that need to be matched in the resulting documents. By default, it is set to30%
.analyzer
: The name of the analyzer that will be used to analyze the text we provided.
An example for a more like this query looks like this:
{ "query" : { "more_like_this" : { "fields" : [ "title", "otitle" ], "like" : "crime and punishment", "min_term_freq" : 1, "min_doc_freq" : 1 } } }
As we said earlier, the like
property can also be used to show which documents the results should be similar to. For example, the following is the query that will use the like
property to point to a given document (note that the following query won't return documents on our example data):
{ "query" : { "more_like_this" : { "fields" : [ "title", "otitle" ], "min_term_freq" : 1, "min_doc_freq" : 1, "like" : [ { "_index" : "library", "_type" : "book", "_id" : "4" } ] } } }
We can also mix the documents and text together:
{ "query" : { "more_like_this" : { "fields" : [ "title", "otitle" ], "min_term_freq" : 1, "min_doc_freq" : 1, "like" : [ { "_index" : "library", "_type" : "book", "_id" : "4" }, "crime and punishment" ] } } }