Elasticsearch Indexing - Sample Chapter
Elasticsearch Indexing - Sample Chapter
Elasticsearch Indexing
Elasticsearch
Indexing
C o m m u n i t y
Hseyin Akdoan
P U B L I S H I N G
pl
$ 29.99 US
19.99 UK
Sa
m
ee
E x p e r i e n c e
D i s t i l l e d
Elasticsearch
Indexing
Improve search experiences with Elasticsearch's powerful
indexing functionality learn how with this practical Elasticsearch
tutorial packed with tips!
Hseyin Akdoan
language. He started learning the Visual Basic language after QuickBasic and
developed many applications until 2000, after which he stepped into the world of
Web with PHP. After this, he came across Java! In addition to counseling and training
activities since 2005, he developed enterprise applications with JavaEE technologies.
His areas of expertise are JavaServer Faces, Spring Frameworks, and big data
technologies such as NoSQL and Elasticsearch. Along with these, he is also trying
to specialize in other big data technologies. Hseyin also writes articles on Java and
big data technologies and works as a technical reviewer of big data books. He was a
reviewer of one of the bestselling books, Mastering Elasticsearch Second Edition.
Preface
The world that we live in is hungry for speed, efficiency, and accuracy. We want quick
results and faster without compromising the accuracy. This is exactly why I have
written this book. I have penned down my years of experience in this book to give
you an insight into how to use Elasticsearch more efficiently in today's big data world.
This book is targeted at experienced developers who have used Elasticsearch before
and want to extend their knowledge about how to effectively perform Elasticsearch
indexing. While reading this book, you'll explore different topics, all of which connect
to efficient indexing and relevant search results in Elasticsearch. We will focus on
understanding the document storage strategy and analysis process in Elasticsearch.
This book will help you understand what is going on behind the scenes when you send
a document for indexing or make a query. In addition, this book will ensure correct
understanding of the meaning of schemaless by asking the questionis the claim that
Elasticsearch stands for the schema-free model always true? After this, you will learn
the analysis process and about analyzers. More importantly, this book will elaborate the
relationship between data analysis and relevant search results. By the end of this book, I
believe you will be in a position to master and unleash this beast of a technology.
Preface
Chapter 4, Analysis and Analyzers, describes analyzers and the analysis process of
Elasticsearch, what tokenizers, the character and token filters, how to configure a
custom analyzer and what text normalization is. This chapter also describes the
relationship between data analysis and relevant search results.
Chapter 5, Anatomy of an Elasticsearch Cluster, covers techniques to choose the right
number of shards and replicas and describes a node, the shard concept, replicas, and
how shard allocation works. It also explains the architecture of data distribution.
Chapter 6, Improving Indexing Performance, covers how to configure memory, how
JVM garbage collector works, why garbage collector is so important for performance,
and how to start tuning garbage collector. It also describes how to control the
amount of I/O operations that Elasticsearch uses for segment merging and to
store modules.
Chapter 7, Snapshot and Restore, covers the Elasticsearch snapshot and restore module,
how to define a snapshot repository, different repository types, the process of
snapshot and restore, and how to configure them. It also describes how the snapshot
process works.
Chapter 8, Improving the User Search Experience, introduces Elasticsearch suggesters,
which allow us to correct spelling mistakes and build efficient autocomplete
mechanisms. It also covers how to improve query relevance by using different
Elasticsearch functionalities such as boosting and synonyms.
Introducing analysis
As mentioned in Chapter 1, Introduction to Efficient Indexing, a huge scale of data is
produced at any moment in today's world of information technologies on various
platforms, such as social media and medium and large-sized companies, which
provide services in communication, health, security, and any other areas. Moreover,
initially, such data is in an unstructured form.
We can see that this point of view on the big data takes into account three basic
needs/concerns/forms:
Big data solutions are mostly related to the aforementioned three basic needs.
Data should be recorded with high performance in order that data can be accessed
with fully high performance benefits; however, it is not enough alone. To get the real
meaning of data, data must be analyzed.
Thanks to data analysis, the well-established search engines like Google and many
social media platforms like Facebook/Twitter are using it successfully.
Let's consider Google with the following screenshot.
Would you accept it if Google does not predict that you're looking for Barcelona
when you search for the phrase barca or if does not ask you the Did you mean
function when you make a spelling mistake?
To be honest, the answer is absolutely not.
If a search engine does not predict what we're looking for, then we use another
search engine that can do it.
We're talking about subtle analysis, and more than that, the exact value of Barca is
not the same as the exact value barca. We are talking about the understanding of a
search. For example, TR relates to Turkey and a search for Jeffrey Jacob Abrams also
relates to J.J. Abrams.
[ 48 ]
Chapter 4
The importance of data analysis occurs at this point because the understanding of the
aforementioned analysis can only be achieved by data analysis.
We will discuss the analysis process in Elasticsearch in the next sections.
Process of analysis
We mentioned in Chapter 1, Introduction to Efficient Indexing and Chapter 2, What is an
Elasticsearch Index that all Apache Lucene's data is stored in the inverted index. This
means that the data is being transformed. The process of transforming data is called
analysis. The analysis process relies on two basic pillars: tokenizing and normalizing.
The first step of the analysis process is to break the text into tokens using tokenizer
after processing by the character filters for the inverted index. Then, it normalizes
these tokens (that is, terms) to make them easily searchable.
[ 49 ]
Elasticsearch also makes the control in query time because an analyzer can be
defined in query time. This means that you can use the analyzer when you want in
query time.
Keep in mind that choosing the correct analyzer is essential
for getting relevant results.
Built-in analyzers
Elasticsearch comes with several analyzers in its standard installation. In the
following table, some analyzers are described:
Analyzer
Description
Standard Analyzer
Simple Analyzer
Whitespace Analyzer
[ 50 ]
Chapter 4
Analyzer
Description
Stop Analyzer
Pattern Analyzer
Language Analyzer
Analyzers fulfill the following three main functions using character filters, tokenizer,
and token filters:
Filtering of characters
Tokenization
Character filters
Character filters are used before being passed to tokenizer at the analysis process.
Elasticsearch has built-in characters filters. Also, you can create your own character
filters to meet your needs.
[ 51 ]
As you can see, Turkish and Latin accent characters are used instead of HTML decimal
code. The original text is klar lmez! (Translation: lovers are immortal!) Let's see
how you get a result when this text is analyzed with standard tokenizer:
curl -XGET 'localhost:9200/_analyze?tokenizer=standard&pretty' -d
'Âşıklar ölmez!'
{
"tokens" : [ {
"token" : "194",
"start_offset" :2,
"end_offset" :5,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "351",
"start_offset" :8,
"end_offset" :11,
"type" : "<NUM>",
"position" : 2
}, {
"token" : "305",
"start_offset" :14,
"end_offset" :17,
"type" : "<NUM>",
"position" : 3
}, {
"token" : "klar",
"start_offset" :18,
"end_offset" :22,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "246",
"start_offset" :25,
"end_offset" :28,
"type" : "<NUM>",
"position" : 5
}, {
"token" : "lmez",
"start_offset" :29,
"end_offset" :33,
"type" : "<ALPHANUM>",
"position" : 6
} ]
}
[ 52 ]
Chapter 4
As you can see, these results are not useful or user-friendly. Remember, if text is
being analyzed in this way, documents containing the word klar are not returned
to us when we search the word klar. In this case, we need a filter to convert the
HTML code of the characters. HTML Strip Char Filter performs this job, as shown:
curl -XGET 'localhost:9200/_analyze?tokenizer=standard&char_
filters=html_strip&pretty' -d 'Âşıklar ölmez!'
{
"tokens" : [ {
"token" : "klar",
"start_offset" :0,
"end_offset" :22,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "lmez",
"start_offset" :23,
"end_offset" :33,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
Tokenizer
Token is one of the basic concepts in the lexical analysis of computer science, which
means that a sequence of characters (that is, string) can turn into a sequence of
tokens. For example, the string hello world becomes [hello, world]. Elasticsearch
has several tokenizers that are used to divide a string down into a stream of terms or
tokens. A simple tokenizer may split the string up into terms wherever it encounters
word boundaries, whitespace, or punctuation.
[ 53 ]
Elasticsearch has built-in tokenizers. You can combine them with character filters to
create custom analyzers. In the following table, some tokenizers are described:
Tokenizer
Description
Standard Tokenizer
Letter Tokenizer
Whitespace Tokenizer
Pattern Tokenizer
Token filters
Token filters accept a stream of modified tokens from tokenizers. Elasticsearch has
built-in token filters. In the following table, some token filters are described:
Token Filter
Description
Normalization Token
Filters
[ 54 ]
Chapter 4
NFC
NFD
NFKC
NFKD
[ 55 ]
The preceding configuration let's normalize all tokens into the NFKC
normalization form.
If you want more information about the ICU, refer to
http://site.icu-project.org. If you want to examine
the plugin, refer to https://github.com/elastic/
elasticsearch-analysis-icu.
Chapter 4
Chapter 4
"type" : "<ALPHANUM>",
"position" : 14
}, {
"token" : "une",
"start_offset" :73,
"end_offset" :76,
"type" : "<ALPHANUM>",
"position" : 15
}, {
"token" : "situation",
"start_offset" :77,
"end_offset" :86,
"type" : "<ALPHANUM>",
"position" : 16
}, {
"token" : "presente",
"start_offset" :87,
"end_offset" :95,
"type" : "<ALPHANUM>",
"position" : 17
} ]
}
As you see, even though a user may enter dj, the filter converts it to deja;
likewise, t is being converted to ete. The ASCII Folding token filter doesn't
require any configuration, but, if desired, you can include directly the one in a
custom analyzer as follows:
curl -XPUT localhost:9200/my_index -d '{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}'
[ 59 ]
An Analyzer Pipeline
If we have a good grasp of the analysis process described so far, a pipeline of an
analyzer should be as shown in the following picture:
Chapter 4
"content": {
"type": "string", "index_analyzer": "whitespace", "search_
analyzer": "standard"
}
}
}
}
}'
{"acknowledged":true}
We defined a simple analyzer to the title field, and whitespace analyzer to the
content field by the preceding configuration. Also, the search analyzer refers to
the standard analyzer in the content field.
Now, we will add a document to the blog index as follows:
curl -XPOST localhost:9200/blog/article -d '{
"title": "My boss's job was eliminated'",
"content": "Hi guys. My boss's job at the office was eliminated due
to budget cuts.'"
}'
{"_index":"blog","_type":"article","_id":"AU-bQRaEOIfz36vMy16h","_
version":1,"created":true}
Now we will search boss's word in the title field:
curl -XGET localhost:9200/blog/_search?pretty -d '{
"query": {
"match": {
"title": "boss's"
}
}
}'
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
[ 61 ]
As you can see, simple analyzer broke the apostrophe. Now, let's search the phrase
guys in the content field for getting same document:
curl -XGET localhost:9200/blog/_search -d '{
"query": {
"match": {
"content": "guys"
}
}
}'
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
[ 62 ]
Chapter 4
"total": 0,
"max_score": null,
"hits": []
}
}
We have a document that contains the guys phrase in the content field but the
document is not returned by the query. Let's see how the Hi guys. sentence is
analyzed in the content field using the Analyze API:
curl -XGET 'localhost:9200/blog/_analyze?field=content&text=Hi
guys.&pretty'
{
"tokens": [
{
"token": "Hi",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "guys.",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 2
}
]
}
As you can see, the whitespace analyzer broke the space and did not remove the
punctuation. If we recreate the blog index with the following configuration, in both
the preceding query will return documents:
curl -XDELETE localhost:9200/blog
{"acknowledged":true}
curl -XPUT localhost:9200/blog -d '{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string", "index_analyzer": "simple", "search_
analyzer": "simple"
},
[ 63 ]
In the preceding configuration, we defined a simple analyzer to the title field for
indexing and search operation. By default, Elasticsearch applies standard analyzer
for fields of a document. This is why we did not define an analyzer for the content
field. Now the document will return to us when we search a boss's phrase.
To summarize this example, when at first we searched the boss's word in the
title field, Elasticsearch did not return any document to us because we used
simple analyzer for indexing on the title field, and this analyzer divided the text at
non-letters. That means that boss's phrase divided the apostrophe by the simple
analyzer. However, the title field uses standard analyzer at search time. Remember
that we did not define a search analyzer for the title field initially. So, the document
was not returned to us because we used two analyzers that have different behaviors
for indexing and searching. By eliminating these differences, the document was
returned to us.
Keep in mind that the same analyzer used at index time and
at search time is very important for the terms of the query to
match the terms of inverted index.
Chapter 4
"pattern":"j2ee|javaee(.*)",
"replacement":"java enterprise edition $1"
}
},
"analyzer" : {
"my_custom_analyzer" : {
"tokenizer" : "standard",
"filter":
["lowercase"],
"char_filter" : ["my_pattern"]
}
}
}
}
}'
{"acknowledged":true}
Summary
In this chapter, we looked at the analysis process and we reviewed the building
blocks of analyzer. After this, we comprehended what the character filters,
tokenizers, and token filters are, and how to specify different analyzers in separate
fields. Finally, we saw how to create a custom analyzer. In the next chapter, you'll
discover the anatomy of an Elasticsearch cluster, what a shard is, what a replica
shard is, what a function replica shard performs, and so on. In addition, we will
examine the questions, how do we configure my cluster correctly? and how do
we determine the correct number of shard and replicas? We will also look at some
relevant cases related to this topic.
[ 65 ]
www.PacktPub.com
Stay Connected: