Mastering Elasticsearch: A Beginner’s Guide to Powerful Searches and Precision — Part 1
Unlock the power of Elasticsearch: dive into Elasticsearch, grasp basic search queries, and explore lexical search
Contents
· Introduction
· Starting where we left off, Elasticsearch
∘ Sample Dataset
∘ Understanding ElasticSearch Queries
∘ Understanding the response
∘ A basic search query
· Lexical Search
· Problems in our current search query
∘ Similar words return different results
∘ Lack of understanding of what the user wants
∘ Similar words are not returned
∘ Typos are ignored
∘ Different combinations of words have different meanings
· Improving our search
∘ Boosting more relevant fields
∘ Boosting based on functions
∘ Fuzzy Queries
· Conclusion
Introduction
Ever wondered how you effortlessly find the perfect pair of shoes online or stumble upon a friend’s post in the vast realm of social media? It’s all thanks to the unsung hero of digital experiences: search systems.
Think back to your latest online purchase — whether it was a stylish pair of shoes or a thoughtful book for a friend. How did you stumble upon exactly what you were looking for? Chances are, you navigated through a sea of options using the search bar! That’s the magic of search systems, quietly shaping our online experiences and making it a breeze to discover the perfect find amidst the digital aisles. In a world teeming with choices, the ability to find what we seek quickly and effortlessly is a testament to the importance of robust and intuitive search systems for the products we love.
In my recent Elasticsearch exploration (check out my primer on its architecture and terminology), we uncovered the engine powering these discoveries. This post delves into search — navigating ElasticSearch queries, comprehending responses, and crafting a basic query to set the stage.
Our goal: build a simple search query, find problems, and improve it with practical examples. Join us in acknowledging the challenges within our current search system and discovering a pathway to refinement in this world of digital aisles.”
Starting where we left off, Elasticsearch
Sample Dataset
To demonstrate different ways we can improve search, let’s set up Elasticsearch and load some data in it. For this post, I will use this News dataset I found on Kaggle. The dataset is pretty simple, it contains around 210,000 news articles, with their headlines, short descriptions, authors, and some other fields we don’t care much about. We don’t really need all 210,000 documents, so I will load up around 10,000 documents in ES and start searching.
These are a few examples of the documents in the dataset —
[
{
"link": "https://www.huffpost.com/entry/new-york-city-board-of-elections-mess_n_60de223ee4b094dd26898361",
"headline": "Why New York City’s Board Of Elections Is A Mess",
"short_description": "“There’s a fundamental problem having partisan boards of elections,” said a New York elections attorney.",
"category": "POLITICS",
"authors": "Daniel Marans",
"country": "IN",
"timestamp": 1689878099
},
....
]
Each document represents a news article. Each article contains a link, headline, a short_description, a category, authors, country(random values, added by me), and timestamp(again random values, added by me).
I added country and timestamp fields to make the examples in the following sections more fun, so let’s begin!
Understanding ElasticSearch Queries
Elasticsearch queries are written in JSON. Instead of diving deep into all the different syntaxes you can use to create search queries, let’s start simple and build from there.
The simplest full-text query is the match query. The idea is simple, you write a query and Elasticsearch performs a full-text search against a specific field. For example,
GET news/_search
{
"query": {
"match": {
"headline": "robbery"
}
}
}
The above query finds all articles where the word “robbery” appears in the “headline”. These are the results I got back –
{
"_index" : "news",
"_id" : "RzrouIsBC1dvdsZHf2cP",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/guard-cat-hailed-as-hero_n_62e9a515e4b00f4cf2352a6f",
"headline" : "Bandit The 'Guard Cat' Hailed As Hero After Thwarting Would-Be Robbery",
"short_description" : "When at least two people tried to break into a Tupelo, Mississippi, home last week, the cat did everything she could to alert its owner.",
"category" : "WEIRD NEWS",
"authors" : "",
"country" : "US",
"timestamp" : 1693070640
}
},
{
"_index" : "news",
"_id" : "WTrouIsBC1dvdsZHp2wd",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/san-francisco-news-crew-security-guard-shot-killed_n_61a2a9d8e4b0ae9a42af278a",
"headline" : "News Crew Security Guard Dies After Being Shot While Helping Robbery Coverage",
"short_description" : "Kevin Nishita was shot in the abdomen while doing his job amid an uptick in organized retail crime.",
"category" : "CRIME",
"authors" : "Daisy Nguyen, AP",
"country" : "US",
"timestamp" : 1692480894
}
}
But, what if you want to perform a full-text search on multiple fields? You can do that by a multi_match query,
GET news/_search
{
"query": {
"multi_match": {
"query": "robbery",
"fields": ["headline", "short_description"]
}
}
}
This performs a similar operation, but instead of looking at a single field, it now looks at both headine and short_description of all the documents and performs a full-text search on them.
Understanding the response
This is a sample response from our last query –
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 29.626675,
"hits" : [
{
"_index" : "news",
"_id" : "RzrouIsBC1dvdsZHf2cP",
"_score" : 29.626675,
"_source" : {
"link" : "https://www.huffpost.com/entry/guard-cat-hailed-as-hero_n_62e9a515e4b00f4cf2352a6f",
"headline" : "Bandit The 'Guard Cat' Hailed As Hero After Thwarting Would-Be Robbery",
"short_description" : "When at least two people tried to break into a Tupelo, Mississippi, home last week, the cat did everything she could to alert its owner.",
"category" : "WEIRD NEWS",
"authors" : "",
"country" : "US",
"timestamp" : 1693070640
}
},
.....
]
}
}
The took field and the timed_out field are pretty easy to understand, they simply represent the time in milliseconds it took for Elasticsearch to return the response, and whether the query was timed out or not.
The _shards field tells how many shards were involved in this search operation, how many of them returned successfully, how many failed, and how many skipped.
The hits field contains the documents returned from the search. Each document is given a score based on how relevant it is to our search. The hits field also contains a field total mentioning the total number of documents returned, and the max score of the documents.
Finally, in the nested field, hits we get all the relevant documents, along with their _id, and their score. The documents are sorted by their scores.
A basic search query
Let’s start building our search query. We can start with a simple query and dissect problems in it –
GET news/_search
{
"query": {
"multi_match": {
"query": "robbery",
"fields": ["headline", "short_description"]
}
}
}
This is a pretty simple query, it just finds all the documents where the word “robbery” appears in any of the given fields, i.e. headline or short_description.
It returns a few results, and we can see all of them have the word “robbery” in it.
{
.....
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 8.164355,
"hits" : [
{
"_index" : "news",
"_id" : "hjrouIsBC1dvdsZHgWdm",
"_score" : 8.164355,
"_source" : {
"link" : "https://www.huffpost.com/entry/lady-gaga-dog-walker-reward_n_62d82efee4b000da23fafad7",
"headline" : "$5K Reward For Suspect In Shooting Of Lady Gaga’s Dog Walker",
"short_description" : "One of the men involved in the violent robbery was mistakenly released from custody in April and remains missing.",
"category" : "U.S. NEWS",
"authors" : "STEFANIE DAZIO, AP",
"country" : "IN",
"timestamp" : 1694863246
}
},
{
"_index" : "news",
"_id" : "RzrouIsBC1dvdsZHf2cP",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/guard-cat-hailed-as-hero_n_62e9a515e4b00f4cf2352a6f",
"headline" : "Bandit The 'Guard Cat' Hailed As Hero After Thwarting Would-Be Robbery",
"short_description" : "When at least two people tried to break into a Tupelo, Mississippi, home last week, the cat did everything she could to alert its owner.",
"category" : "WEIRD NEWS",
"authors" : "",
"country" : "US",
"timestamp" : 1693070640
}
},
{
"_index" : "news",
"_id" : "WTrouIsBC1dvdsZHp2wd",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/san-francisco-news-crew-security-guard-shot-killed_n_61a2a9d8e4b0ae9a42af278a",
"headline" : "News Crew Security Guard Dies After Being Shot While Helping Robbery Coverage",
"short_description" : "Kevin Nishita was shot in the abdomen while doing his job amid an uptick in organized retail crime.",
"category" : "CRIME",
"authors" : "Daisy Nguyen, AP",
"country" : "US",
"timestamp" : 1692480894
}
}
]
}
}
Lexical Search
What we’ve engaged in thus far is referred to as ‘lexical search.’ In this type of search, the system seeks precise matches for a given word or phrase within documents. In essence, when a user inputs ‘robbery,’ our search query identifies all documents containing the exact term ‘robbery.’ While this method may appear intuitive initially, its limitations become apparent quite swiftly, as we will soon discover.
Problems in our current search query
Similar words return different results
Let’s take a few examples, let’s see what we get when the user searches for “robbed” —
GET news/_search
{
"query": {
"multi_match": {
"query": "robbed",
"fields": ["headline", "short_description"]
}
}
}
These are the results I get back —
{
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 7.9044275,
"hits" : [
{
"_index" : "news",
"_id" : "YTrouIsBC1dvdsZHh2jf",
"_score" : 7.9044275,
"_source" : {
"link" : "https://www.huffpost.com/entry/multiple-guns-robbery-wellston-market-missouri_n_62994cbee4b05fe694f296ad",
"headline" : "Man Robbed Of Assault Rifle At Gunpoint Opens Fire With Second Gun",
"short_description" : "The accused robber was struck multiple times, and two bystanders were wounded in the St. Louis shootout.",
"category" : "CRIME",
"authors" : "Mary Papenfuss",
"country" : "IN",
"timestamp" : 1691458552
}
},
{
"_index" : "news",
"_id" : "YDrouIsBC1dvdsZH73UQ",
"_score" : 7.8303137,
"_source" : {
"link" : "https://www.huffpost.com/entry/michigan-militia-training-video-gretchen-whitmer_n_5f8b6e26c5b6dc2d17f78e0a",
"headline" : "Chilling Training Videos Released Of Militia Men Charged In Michigan Gov. Kidnap Plot",
"short_description" : ""I’m sick of being robbed and enslaved by the state ... they are the enemy. Period," says one suspect in a video.",
"category" : "POLITICS",
"authors" : "Mary Papenfuss",
"country" : "IN",
"timestamp" : 1692613291
}
}
]
}
To keep it simple, these are the headlines of the documents I got back —
1. "Man Robbed Of Assault Rifle At Gunpoint Opens Fire With Second Gun"
2. "Chilling Training Videos Released Of Militia Men Charged In Michigan Gov. Kidnap Plot"
Both of these documents contain the word “robbed” in either the headline or the description. But if the user had searched for “robbery”, then we would see a completely different set of documents in the results –
[
{
"_index" : "news",
"_id" : "hjrouIsBC1dvdsZHgWdm",
"_score" : 8.164355,
"_source" : {
"link" : "https://www.huffpost.com/entry/lady-gaga-dog-walker-reward_n_62d82efee4b000da23fafad7",
"headline" : "$5K Reward For Suspect In Shooting Of Lady Gaga’s Dog Walker",
"short_description" : "One of the men involved in the violent robbery was mistakenly released from custody in April and remains missing.",
"category" : "U.S. NEWS",
"authors" : "STEFANIE DAZIO, AP",
"country" : "IN",
"timestamp" : 1694863246
}
},
{
"_index" : "news",
"_id" : "YTrouIsBC1dvdsZHh2jf",
"_score" : 8.079888,
"_source" : {
"link" : "https://www.huffpost.com/entry/multiple-guns-robbery-wellston-market-missouri_n_62994cbee4b05fe694f296ad",
"headline" : "Man Robbed Of Assault Rifle At Gunpoint Opens Fire With Second Gun",
"short_description" : "The accused robber was struck multiple times, and two bystanders were wounded in the St. Louis shootout.",
"category" : "CRIME",
"authors" : "Mary Papenfuss",
"country" : "IN",
"timestamp" : 1691458552
}
},
{
"_index" : "news",
"_id" : "RzrouIsBC1dvdsZHf2cP",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/guard-cat-hailed-as-hero_n_62e9a515e4b00f4cf2352a6f",
"headline" : "Bandit The 'Guard Cat' Hailed As Hero After Thwarting Would-Be Robbery",
"short_description" : "When at least two people tried to break into a Tupelo, Mississippi, home last week, the cat did everything she could to alert its owner.",
"category" : "WEIRD NEWS",
"authors" : "",
"country" : "US",
"timestamp" : 1693070640
}
},
{
"_index" : "news",
"_id" : "WTrouIsBC1dvdsZHp2wd",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/san-francisco-news-crew-security-guard-shot-killed_n_61a2a9d8e4b0ae9a42af278a",
"headline" : "News Crew Security Guard Dies After Being Shot While Helping Robbery Coverage",
"short_description" : "Kevin Nishita was shot in the abdomen while doing his job amid an uptick in organized retail crime.",
"category" : "CRIME",
"authors" : "Daisy Nguyen, AP",
"country" : "US",
"timestamp" : 1692480894
}
}
]
1. "$5K Reward For Suspect In Shooting Of Lady Gaga's Dog Walker"
2. "Man Robbed Of Assault Rifle At Gunpoint Opens Fire With Second Gun"
3. "Bandit The 'Guard Cat' Hailed As Hero After Thwarting Would-Be Robbery"
4. "News Crew Security Guard Dies After Being Shot While Helping Robbery Coverage"
So in short, we get different results if the user searches for “robbery” than if the user searches for “robbed”. This is obviously not ideal, if the user has searched for any of these(or anything related to “rob”), we should show all documents that contain different forms of the word “rob”(called “inflected” forms) in the query.
Lack of understanding of what the user wants
“The goal of a designer is to listen, observe, understand, sympathize, empathize, synthesize, and glean insights that enable him or her to make the invisible visible.” — Hillman Curtis
We are trying to retrieve documents aligning with the user’s query, a task that extends beyond the user’s input alone. By delving into additional parameters, we gain a much richer understanding of our user’s preferences and needs.
For example, when a user searches for news, their interest likely extends beyond relevance alone; how recent the news article is often crucial. To enhance our search precision, we can fine-tune the scoring mechanism.
Moreover, we also have a location field in our articles. This field signifies the geographical origin of the news, presenting an opportunity to further refine our results. We can use this to boost articles from the user’s country.
Similar words are not returned
Since we are only returning articles that contain the exact match the user queried for, we are likely missing relevant documents that contain similar words. For example, if I search for “theft”, I get the following articles,
[
{
"_index" : "news",
"_id" : "dzrouIsBC1dvdsZHiGh1",
"_score" : 8.079888,
"_source" : {
"link" : "https://www.huffpost.com/entry/ap-us-church-theft-beheaded-statue_n_629504abe4b0933e7376f2fa",
"headline" : "Angel Statue Beheaded At Church, $2 Million Relic Stolen",
"short_description" : "The New York City church says its stolen 18-carat gold relic was guarded by its own security system and is irreplaceable due to its historical and artistic value.",
"category" : "CRIME",
"authors" : "Michael R. Sisak, AP",
"country" : "IN",
"timestamp" : 1699477455
}
},
{
"_index" : "news",
"_id" : "ATrouIsBC1dvdsZHrG2n",
"_score" : 7.4066687,
"_source" : {
"link" : "https://www.huffpost.com/entry/joseph-sobolewski-charge-dropped-mountain-dew-felony_n_617a15e0e4b0657357447ee2",
"headline" : "Prosecutors Drop Felony Charge Against Man Accused Of 43 Cent Soda Theft",
"short_description" : "Joseph Sobolewski faced up to seven years in prison after paying $2 for a Mountain Dew that cost $2.29 plus tax.",
"category" : "U.S. NEWS",
"authors" : "Nick Visser",
"country" : "IN",
"timestamp" : 1698883200
}
},
{
"_index" : "news",
"_id" : "ZDrouIsBC1dvdsZH73Uq",
"_score" : 7.153779,
"_source" : {
"link" : "https://www.huffpost.com/entry/missing-lemur-found_n_5f8b2c33c5b6dc2d17f76bdb",
"headline" : "'There's A Lemur!' 5-Year-Old Helps Crack San Francisco Zoo Theft Case",
"short_description" : """The arthritic, 21-year-old lemur is "agitated" but safe, zoo staff say.""",
"category" : "U.S. NEWS",
"authors" : "",
"country" : "IN",
"timestamp" : 1698560597
}
}
]
The word “robbery” may have a different meaning than the word “theft”, but it’s still a relevant word, and the user may be interested in seeing articles with the word “robbery” as well(although at a lower relevance score than the documents that contain the exact word the user searched for)
There can be many similar words to theft, each having different levels of similarity. For example, “theft” may be more similar to “shoplifting” and less similar to “burglary”. But both can be synonymous to each other in certain contexts and can be of some relevance, though not as relevant as the exact word in the query, i.e. “theft”.
Our current search doesn’t consider the similarity of words in documents and in the query. If a user searches for “theft”, only the articles containing the word “theft” are returned, whereas we should also return articles containing words similar to “theft”(like “burglary” or “robbery”).
Typos are ignored
“If the user can’t use it, it doesn’t work.” — Susan Dray
Another issue is that any typo by a user would return empty results. We know that users may accidentally make typos, and we don’t want to return empty results. For example, searching “robbey” on Google News still returns results related to “robbery”.
Different combinations of words have different meanings
Let’s look at an example, let’s assume the user made this query —
GET news/_search
{
"query": {
"multi_match": {
"query": "new jersey covid virus"
}
}
}
For you and me, it’s obvious the user wants to search for news related to “covid” or “virus” in “New Jersey”. But to our search engine, each of these words means the same, and it has no way to understand that the ordering of these words matters(for example, “New” and “Jersey” in “New Jersey”).
Let’s look at the top three results,
{
"_index" : "news",
"_id" : "0jrouIsBC1dvdsZH03FH",
"_score" : 15.199991,
"_source" : {
"link" : "https://www.huffpost.com/entry/covid-new-york-new-jersey-trend_n_60611769c5b6531eed0621da",
"headline" : "Virus Fight Stalls In Early Hot Spots New York, New Jersey",
"short_description" : "New Jersey has been reporting about 647 new cases for every 100,000 residents over the past 14 days. New York has averaged 548.",
"category" : "U.S. NEWS",
"authors" : "Marina Villeneuve and Mike Catalini, AP",
"country" : "US",
"timestamp" : 1697056489
}
},
{
"_index" : "news",
"_id" : "zzrouIsBC1dvdsZH23Ig",
"_score" : 12.708103,
"_source" : {
"link" : "https://www.huffpost.com/entry/new-variants-raise-worry-about-covid-19-virus-reinfections_n_602193e6c5b689330e31dcc4",
"headline" : "New Variants Raise Worry About COVID-19 Virus Reinfections",
"short_description" : "Scientists discovered a new version of the virus in South Africa that’s more contagious and less susceptible to certain treatments.",
"category" : "WORLD NEWS",
"authors" : "Marilynn Marchione, AP`",
"country" : "IN",
"timestamp" : 1693063095
}
},
{
"_index" : "news",
"_id" : "fTrouIsBC1dvdsZH6HQF",
"_score" : 11.707885,
"_source" : {
"link" : "https://www.huffpost.com/entry/new-york-covid-19-religious-gatherings_n_5fbf42b8c5b66bb88c6430ac",
"headline" : "Supreme Court Blocks New York COVID-19 Restrictions On Religious Gatherings",
"short_description" : "It was the first major decision since Justice Amy Coney Barrett joined the nation's highest court.",
"category" : "POLITICS",
"authors" : "Lawrence Hurley, Reuters",
"country" : "IN",
"timestamp" : 1693362371
}
},
If you look carefully at the results above, you’ll notice that the second result, “New Variants Raise Worry About COVID-19 Virus Reinfections” is completely unrelated to New Jersey. In fact, after reading the description, it seems to be more related to COVID-19 infections in South Africa!
This is because the words “COVID”, “virus” and “New” are part of the document, because of this, the document gets a higher score. However, this is not at all relevant to the user query. Our search system does not understand that the terms “New” and “Jersey” should be treated as a single term.
Improving our search
Boosting more relevant fields
“Words have a weight so if you are going to say something heavy, make sure to pick the right ones.” — Lang Leav
We can decide to boost certain fields or certain values which might be more useful in understanding what an article is about. For example, the headline of the article might be more meaningful than the description of the article.
Let’s take an example query. Let’s assume the user is trying to search for elections, this would be our Elasticsearch query —
GET news/_search
{
"query": {
"multi_match": {
"query": "elections",
"type": "most_fields",
"fields": ["short_description", "headline"]
}
}
}
These are the results we get back-
{
"_index" : "news",
"_id" : "qDrouIsBC1dvdsZHwW-a",
"_score" : 15.736175,
"_source" : {
"link" : "https://www.huffpost.com/entry/new-york-city-board-of-elections-mess_n_60de223ee4b094dd26898361",
"headline" : "Why New York City’s Board Of Elections Is A Mess",
"short_description" : "“There’s a fundamental problem having partisan boards of elections,” said a New York elections attorney.",
"category" : "POLITICS",
"authors" : "Daniel Marans",
"country" : "IN",
"timestamp" : 1689878099
}
},
{
"_index" : "news",
"_id" : "8zrouIsBC1dvdsZH63Si",
"_score" : 7.729385,
"_source" : {
"link" : "https://www.huffpost.com/entry/20-funniest-tweets-from-women-oct-31-nov-6_n_5fa209fac5b686950033b3e7",
"headline" : "The 20 Funniest Tweets From Women This Week (Oct. 31-Nov. 6)",
"short_description" : ""Hear me out: epidurals, but for elections."",
"category" : "WOMEN",
"authors" : "Caroline Bologna",
"country" : "IN",
"timestamp" : 1694723723
}
},
{
"_index" : "news",
"_id" : "zzrouIsBC1dvdsZH8nVe",
"_score" : 7.353842,
"_source" : {
"link" : "https://www.huffpost.com/entry/childrens-books-elections-voting_l_5f728844c5b6f622a0c368a1",
"headline" : "25 Children's Books That Teach Kids About Elections And Voting",
"short_description" : "Parents can use these stories to educate their little ones about the American political process.",
"category" : "PARENTING",
"authors" : "Caroline Bologna",
"country" : "IN",
"timestamp" : 1697290393
}
},
If you look at the second article “The 20 Funniest Tweets From Women This Week (Oct. 31-Nov. 6)”, you can see it doesn’t even seem to be about elections. However, due to the presence of the word ‘election’ in the description, Elasticsearch deemed it a relevant result. Perhaps, there’s room for improvement. It makes intuitive sense that articles with headings matching the user’s query would be more relevant. To achieve this, we can instruct Elasticsearch to boost the heading field, essentially assigning it greater importance than the short_description field in score calculations.
This is pretty simple to do in our query-
GET news/_search
{
"query": {
"multi_match": {
"query": "elections",
"type": "most_fields",
"fields": ["headline^4", "short_description"]
}
}
}
Notice the heading^4 that I put in fields. This simply means that the field “heading” is boosted by 4. Let’s look at the results now,
{
"_index" : "news",
"_id" : "qDrouIsBC1dvdsZHwW-a",
"_score" : 37.7977,
"_source" : {
"link" : "https://www.huffpost.com/entry/new-york-city-board-of-elections-mess_n_60de223ee4b094dd26898361",
"headline" : "Why New York City’s Board Of Elections Is A Mess",
"short_description" : "“There’s a fundamental problem having partisan boards of elections,” said a New York elections attorney.",
"category" : "POLITICS",
"authors" : "Daniel Marans",
"country" : "IN",
"timestamp" : 1689878099
}
},
{
"_index" : "news",
"_id" : "zzrouIsBC1dvdsZH8nVe",
"_score" : 29.415367,
"_source" : {
"link" : "https://www.huffpost.com/entry/childrens-books-elections-voting_l_5f728844c5b6f622a0c368a1",
"headline" : "25 Children's Books That Teach Kids About Elections And Voting",
"short_description" : "Parents can use these stories to educate their little ones about the American political process.",
"category" : "PARENTING",
"authors" : "Caroline Bologna",
"country" : "IN",
"timestamp" : 1697290393
}
},
{
"_index" : "news",
"_id" : "_jrouIsBC1dvdsZH3HKZ",
"_score" : 29.415367,
"_source" : {
"link" : "https://www.huffpost.com/entry/shirley-weber-first-black-california-secretary-of-state_n_6014651ec5b6aa4bad33e87b",
"headline" : "Shirley Weber Sworn In As California's First Black Elections Chief",
"short_description" : "She vacates her Assembly seat to be the new secretary of state, replacing Alex Padilla, who last week became the first Latino U.S. senator for California.",
"category" : "POLITICS",
"authors" : "Sarah Ruiz-Grossman",
"country" : "IN",
"timestamp" : 1697300728
}
},
{
"_index" : "news",
"_id" : "NzrouIsBC1dvdsZHnWvd",
"_score" : 26.402336,
"_source" : {
"link" : "https://www.huffpost.com/entry/josh-hawley-democrats-dont-accept-elections_n_61ea1949e4b01440a689bedc",
"headline" : "Sen. Josh Hawley Says, Without Irony, That Democrats Don't Accept Elections They Lose",
"short_description" : "The Missouri Republican led the charge on Jan. 6 to object to Joe *****'s win -- right after he saluted pro-***** protesters gathering at the U.S. Capitol.",
"category" : "POLITICS",
"authors" : "Josephine Harvey",
"country" : "IN",
"timestamp" : 1692046727
}
},
We can see now that all the top results contain the word “election” in the heading and thus, the returned articles are more relevant.
Boosting based on functions
While we have boosted certain fields, we also want to introduce two new types of boosts based on what users want when searching for news.
- We want to boost articles from the user’s country. We don’t simply want to filter based on country since that might lead to irrelevant results appearing at the top, but we also don’t want to ignore it completely. In short, we want to give more weight to articles from the user’s country.
- We want to boost more recent news. We don’t simply want to sort based on recency since that also might lead to irrelevant results appearing at the top, instead, we want to balance recency with relevance.
Let’s see how to do this.
In Elasticsearch, we can use the function_score query to apply custom scoring functions, including boosting. The function_score query allows you to modify the score of documents based on various functions. To put it simply, we can boost certain documents based on conditions.
Let’s start by boosting the user’s country. Let’s assume the user’s country is “US” and plug it into the query when sending it to Elasticsearch. To achieve this, we need to add a function_score block, which allows custom scoring functions to be applied to the results of a query. We can define multiple functions for a given query, specifying conditions on matching the document and the boost value.
We can define a function to boost user’s country —
{
"filter": {
"term": {
"country.keyword": "US"
}
},
"weight": 2
}
This boosts articles by 2 where the country is “US”.
Next, let’s try to boost recent news on top. We can do this by using field_value_factor. The field_value_factor function allows us to use a field from a document to influence its score which is precisely what we want. Let’s see how it looks —
{
"field_value_factor": {
"field": "timestamp",
"factor": 2
}
}
The term factor specifies the multiplier or factor by which the values of the specified field should influence the score. With this function, documents with more recent timestamps will be given higher scores.
Our full query becomes —
GET news/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "covid",
"fields": ["headline^4", "short_description"]
}
},
{
"function_score": {
"query": {
"multi_match": {
"query": "covid",
"fields": ["headline^4", "short_description"]
}
},
"functions": [
{
"filter": {
"term": {
"country.keyword": "US"
}
},
"weight": 2
}, {
"field_value_factor": {
"field": "timestamp",
"factor": 2
}
}
]
}
}
]
}
}
}
Now recent documents and documents from the user’s country will be given a higher score. We can tune this balance by configuring the values for the weight and the factor fields.
Fuzzy Queries
Next, let’s fix typos in the search query.
In Elasticsearch, we can perform fuzzy searches to retrieve documents that match a specified term even if there are slight variations in the spelling or characters. To do this, we can simply add a fuzziness field to our query. Our final query becomes —
GET news/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "covi",
"fields": ["headline^4", "short_description"],
"fuzziness": 1
}
},
{
"function_score": {
"query": {
"multi_match": {
"query": "covi",
"fields": ["headline^4", "short_description"],
"fuzziness": 1
}
},
"functions": [
{
"filter": {
"term": {
"country.keyword": "US"
}
},
"weight": 2
}, {
"field_value_factor": {
"field": "timestamp",
"factor": 2
}
}
]
}
}
]
}
}
}
There is a lot more to spelling correction than simply adding fuzziness. Check out this blog post if you want to learn more.
Conclusion
In this blog post, we’ve dived into the nuts and bolts of Elasticsearch, starting with a hands-on look at a sample dataset and the basics of crafting search queries. We’ve demystified Elasticsearch responses and walked through a basic search query, laying the foundation for effective exploration.
As we explored lexical search, we recognized some quirks in our current search approach. To address these challenges, we introduced boosting and fuzziness — handy tools to fine-tune our searches and deal with real-world data complexities.
As we wrap up here, consider this a pit stop on our journey toward search excellence. In the next part, we’ll delve into advanced strategies to overcome specific issues in our current search approach. Brace yourself for the fascinating world of semantic search, where the focus shifts from just matching keywords to understanding the meaning behind them, paving the way for more intuitive and context-aware search experiences. Get ready to take your Elasticsearch adventure to the next level!
Enjoyed the journey through Elasticsearch? Follow me on Medium for more articles. For quicker bites of knowledge(tidbits about what I am reading about, cheatsheets, etc.), follow me on LinkedIn with regular short-form content(for example, while reading about Elasticsearch, I discussed how a particular scoring function, called tf-idf works in a brief 5 minute post here). Let’s stay connected on this exploration of tech and data!
Mastering Elasticsearch: A Beginner’s Guide to Powerful Searches and Precision — Part 1 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Mastering Elasticsearch: A Beginner’s Guide to Powerful Searches and Precision — Part 1