Elasticsearch Indexing - Sample Chapter

Fr
Big data means more information at your disposal, which

can only be a good thing, right? Perhaps, but only if
that information is well organized and accessible. With
Elasticsearch, delivering information to users is easy by
indexing documents and organizing them in an effective
way, you can make the search experience even better
for users.
Beginning with an overview of the way Elasticsearch stores
data, you'll begin to extend your knowledge to tackle indexing
and mapping, and learn how to configure Elasticsearch
to meet your users' needs. You'll then find out how to use
analysis and analyzers to organize and pull up search
results in a more intelligent way to guarantee that every
search query is met with the relevant results! You'll explore
the anatomy of an Elasticsearch cluster, and learn how to
set up configurations that give you optimum availability as
well as scalability. Once you've learned each component
outlined in the book, you will be confident that you can help
to deliver an improved search experience.
What you will learn from this book

Learn how Elasticsearch efficiently stores
data and find out how it can reduce costs
Elasticsearch Indexing
Elasticsearch
Indexing
C o m m u n i t y
Find out how an Elasticsearch cluster works

and learn the best way to configure it
Perform high-speed indexing with low system
resource cost
Improve query relevance with appropriate
mapping, routing, suggest API, and other
Elasticsearch functionalities
Prices do not include

local sales tax or VAT
where applicable
Visit www.PacktPub.com for books, eBooks,

code, downloads, and PacktLib.
Hseyin Akdoan
P U B L I S H I N G
pl
Use Elasticsearch analysis and analyzers

to incorporate greater intelligence and
organization across your documents
and data
$ 29.99 US
19.99 UK
community experience distilled
Sa
m
Control document metadata with the correct

mapping strategies and by configuring indices
Who this book is written for

If you understand the importance of a great search
experience, this book will show you exactly how to
build one with Elasticsearch, one of the world's leading
search servers.
ee
E x p e r i e n c e
D i s t i l l e d
Elasticsearch
Indexing
Improve search experiences with Elasticsearch's powerful
indexing functionality learn how with this practical Elasticsearch
tutorial packed with tips!
Hseyin Akdoan
In this package, you will find:
The author biography

A preview chapter from the book, Chapter 4 'Analysis and Analyzers'
A synopsis of the books content
More information on Elasticsearch Indexing
About the Author

Hseyin Akdoan began his software adventure with the GwBasic programming
language. He started learning the Visual Basic language after QuickBasic and
developed many applications until 2000, after which he stepped into the world of
Web with PHP. After this, he came across Java! In addition to counseling and training
activities since 2005, he developed enterprise applications with JavaEE technologies.
His areas of expertise are JavaServer Faces, Spring Frameworks, and big data
technologies such as NoSQL and Elasticsearch. Along with these, he is also trying
to specialize in other big data technologies. Hseyin also writes articles on Java and
big data technologies and works as a technical reviewer of big data books. He was a
reviewer of one of the bestselling books, Mastering Elasticsearch Second Edition.
Preface
The world that we live in is hungry for speed, efficiency, and accuracy. We want quick
results and faster without compromising the accuracy. This is exactly why I have
written this book. I have penned down my years of experience in this book to give
you an insight into how to use Elasticsearch more efficiently in today's big data world.
This book is targeted at experienced developers who have used Elasticsearch before
and want to extend their knowledge about how to effectively perform Elasticsearch
indexing. While reading this book, you'll explore different topics, all of which connect
to efficient indexing and relevant search results in Elasticsearch. We will focus on
understanding the document storage strategy and analysis process in Elasticsearch.
This book will help you understand what is going on behind the scenes when you send
a document for indexing or make a query. In addition, this book will ensure correct
understanding of the meaning of schemaless by asking the questionis the claim that
Elasticsearch stands for the schema-free model always true? After this, you will learn
the analysis process and about analyzers. More importantly, this book will elaborate the
relationship between data analysis and relevant search results. By the end of this book, I
believe you will be in a position to master and unleash this beast of a technology.
What this book covers

Chapter 1, Introduction to Efficient Indexing, will introduce you to the document
storage strategy and the basic concepts related to the analysis process.
Chapter 2, What is an Elasticsearch Index, describes the concept of Elasticsearch
Index, how the inverted index mechanism works, why you should use data
denormalization, and what its benefits. In addition to this, it explains dynamic
mapping and index flexibility.
Chapter 3, Basic Concepts of Mapping, describes the basic concepts and definitions of
mapping. It answers the question what is the relationship between mapping and
relevant search results questions. It explains the meaning of schemaless. It also
covers metadata fields and data types.
Preface
Chapter 4, Analysis and Analyzers, describes analyzers and the analysis process of
Elasticsearch, what tokenizers, the character and token filters, how to configure a
custom analyzer and what text normalization is. This chapter also describes the
relationship between data analysis and relevant search results.
Chapter 5, Anatomy of an Elasticsearch Cluster, covers techniques to choose the right
number of shards and replicas and describes a node, the shard concept, replicas, and
how shard allocation works. It also explains the architecture of data distribution.
Chapter 6, Improving Indexing Performance, covers how to configure memory, how
JVM garbage collector works, why garbage collector is so important for performance,
and how to start tuning garbage collector. It also describes how to control the
amount of I/O operations that Elasticsearch uses for segment merging and to
store modules.
Chapter 7, Snapshot and Restore, covers the Elasticsearch snapshot and restore module,
how to define a snapshot repository, different repository types, the process of
snapshot and restore, and how to configure them. It also describes how the snapshot
process works.
Chapter 8, Improving the User Search Experience, introduces Elasticsearch suggesters,
which allow us to correct spelling mistakes and build efficient autocomplete
mechanisms. It also covers how to improve query relevance by using different
Elasticsearch functionalities such as boosting and synonyms.
Analysis and Analyzers

In the previous chapter, we looked at the basic concepts and definitions of mapping.
We talked about fields of metadata and data types. Then, we discussed the
relationship between mapping and relevant search results. Finally, we tried to have a
good grasp of understanding what the schema-less is in Elasticsearch.
In this chapter, we will review the process of analysis and analyzers. We will
examine the tokenizers and we will look closely at the character and token filters.
In addition, we will review how to add analyzers to an Elasticsearch configuration.
By the end of this chapter, we would have covered the following topics:
What is analysis process?

What is built-in analyzers?
What are doing tokenizers, character, and token filters?
What is text normalization?
How to create custom analyzers?
Introducing analysis
As mentioned in Chapter 1, Introduction to Efficient Indexing, a huge scale of data is
produced at any moment in today's world of information technologies on various
platforms, such as social media and medium and large-sized companies, which
provide services in communication, health, security, and any other areas. Moreover,
initially, such data is in an unstructured form.
We can see that this point of view on the big data takes into account three basic
needs/concerns/forms:
Recording of data by high performance

Accessing of data by high performance
Analyzing of data
[ 47 ]
Big data solutions are mostly related to the aforementioned three basic needs.
Data should be recorded with high performance in order that data can be accessed
with fully high performance benefits; however, it is not enough alone. To get the real
meaning of data, data must be analyzed.
Thanks to data analysis, the well-established search engines like Google and many
social media platforms like Facebook/Twitter are using it successfully.
Let's consider Google with the following screenshot.
Would you accept it if Google does not predict that you're looking for Barcelona
when you search for the phrase barca or if does not ask you the Did you mean
function when you make a spelling mistake?
To be honest, the answer is absolutely not.
If a search engine does not predict what we're looking for, then we use another
search engine that can do it.
We're talking about subtle analysis, and more than that, the exact value of Barca is
not the same as the exact value barca. We are talking about the understanding of a
search. For example, TR relates to Turkey and a search for Jeffrey Jacob Abrams also
relates to J.J. Abrams.
[ 48 ]
Chapter 4
The importance of data analysis occurs at this point because the understanding of the
aforementioned analysis can only be achieved by data analysis.
We will discuss the analysis process in Elasticsearch in the next sections.
Process of analysis
We mentioned in Chapter 1, Introduction to Efficient Indexing and Chapter 2, What is an
Elasticsearch Index that all Apache Lucene's data is stored in the inverted index. This
means that the data is being transformed. The process of transforming data is called
analysis. The analysis process relies on two basic pillars: tokenizing and normalizing.
The first step of the analysis process is to break the text into tokens using tokenizer
after processing by the character filters for the inverted index. Then, it normalizes
these tokens (that is, terms) to make them easily searchable.
[ 49 ]
Inverted index processes are performed by analyzers. Generally, an analyzer is

composed of a tokenizer and one or more token filters. During the indexing time,
when Elasticsearch processes a field that must be indexed, it checks whether an
analyzer is defined at several levels or not because an analyzer can be specified at
several levels.
The check order is as follows:
1. At field level
2. At type level
3. At index level
4. At node level
The _analyzer field is used to define document-level
analyzer. It is deprecated in 1.5.0 version.
Elasticsearch also makes the control in query time because an analyzer can be
defined in query time. This means that you can use the analyzer when you want in
query time.
Keep in mind that choosing the correct analyzer is essential
for getting relevant results.
Built-in analyzers
Elasticsearch comes with several analyzers in its standard installation. In the
following table, some analyzers are described:
Analyzer
Description
Standard Analyzer
This uses Standard Tokenizer to divide text. Other components

are Standard Token Filter, Lower Case Token Filter, and Stop
Token Filter. It normalizes tokens, lowercases tokens, and also
removes unwanted tokens. By default, Elasticsearch applies the
standard analyzer.
Simple Analyzer
This uses Letter Tokenizer to divide text. Another component is

Lower Case Tokenizer. It lowercases tokens.
Whitespace Analyzer
This uses Whitespace Tokenizer to divide text at spaces.
[ 50 ]
Chapter 4
Analyzer
Description
Stop Analyzer
This uses Letter Tokenizer to divide text. Other components are

Lower Case Tokenizer and Stop Token Filter. It removes stop
words from token streams.
Pattern Analyzer
This uses a regular expression to divide text. It accepts lowercase

and stop words setting.
Language Analyzer
A set of analyzers analyze the text for a specific language.

Languages supported are: Arabic, Armenian, Basque, Brazilian,
Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English,
finish, French, Galician, German, Greek, Hindi, Hungarian,
Indonesian, Irish, Italian, Latvian, Norwegian, Persian,
Portuguese, Romanian, Russian, Spanish, Swedish, Turkish,
and Thai.
Analyzers fulfill the following three main functions using character filters, tokenizer,
and token filters:
Filtering of characters
Tokenization
Filtering of the term
Let's look at the main function of how closely it is realized now.
Building blocks of Analyzer

In the analysis process, a tokenizer is used to break a text into tokens. Before
this operation, the text is passed through any character filter. Then, token filters
start working.
Character filters
Character filters are used before being passed to tokenizer at the analysis process.
Elasticsearch has built-in characters filters. Also, you can create your own character
filters to meet your needs.
HTML Strip Char filter

This filter is stripping out HTML markup from an analyzed text. For example, consider
the following verse belonging to the Turkish poet and sufi mystic Yunus Emre:
Âşıklar ölmez!
[ 51 ]
As you can see, Turkish and Latin accent characters are used instead of HTML decimal
code. The original text is klar lmez! (Translation: lovers are immortal!) Let's see
how you get a result when this text is analyzed with standard tokenizer:
curl -XGET 'localhost:9200/_analyze?tokenizer=standard&pretty' -d
'Âşıklar ölmez!'
{
"tokens" : [ {
"token" : "194",
"start_offset" :2,
"end_offset" :5,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "351",
"start_offset" :8,
"end_offset" :11,
"type" : "<NUM>",
"position" : 2
}, {
"token" : "305",
"start_offset" :14,
"end_offset" :17,
"type" : "<NUM>",
"position" : 3
}, {
"token" : "klar",
"start_offset" :18,
"end_offset" :22,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "246",
"start_offset" :25,
"end_offset" :28,
"type" : "<NUM>",
"position" : 5
}, {
"token" : "lmez",
"start_offset" :29,
"end_offset" :33,
"position" : 6
} ]
}
[ 52 ]
Chapter 4
As you can see, these results are not useful or user-friendly. Remember, if text is
being analyzed in this way, documents containing the word klar are not returned
to us when we search the word klar. In this case, we need a filter to convert the
HTML code of the characters. HTML Strip Char Filter performs this job, as shown:
curl -XGET 'localhost:9200/_analyze?tokenizer=standard&char_
filters=html_strip&pretty' -d 'Âşıklar ölmez!'
{
"tokens" : [ {
"token" : "klar",
"start_offset" :0,
"end_offset" :22,
"position" : 1
}, {
"token" : "lmez",
"start_offset" :23,
"end_offset" :33,
"position" : 2
} ]
}
Pattern Replace Char filter

This char filter allows using a regex to manipulate the characters. The usage of the
filter will be exemplified in the Creating a Custom Analyzer section.
Tokenizer
Token is one of the basic concepts in the lexical analysis of computer science, which
means that a sequence of characters (that is, string) can turn into a sequence of
tokens. For example, the string hello world becomes [hello, world]. Elasticsearch
has several tokenizers that are used to divide a string down into a stream of terms or
tokens. A simple tokenizer may split the string up into terms wherever it encounters
word boundaries, whitespace, or punctuation.
[ 53 ]
Elasticsearch has built-in tokenizers. You can combine them with character filters to
create custom analyzers. In the following table, some tokenizers are described:
Tokenizer
Description
Standard Tokenizer
This finds the boundaries between words and then

divides text. To do this, it uses the Unicode Text
Segmentation algorithm.
Letter Tokenizer
This divides text at non-letters and converts them to

lower case that performs the function of Letter Tokenizer
and the Lower Case Token Filter together.
Whitespace Tokenizer
This divides text at spaces.
Pattern Tokenizer
This divides text at via a regular expression.
UAX Email URL Tokenizer
This tokenizes e-mails and URLs as single tokens. It

works like the standard tokenizer.
Path Hierarchy Tokenizer
This divides text at delimiters (defaults character

delimiter to '/').
If you want more information about the Unicode Standard Annex

#29, refer to http://unicode.org/reports/tr29/.
Token filters
Token filters accept a stream of modified tokens from tokenizers. Elasticsearch has
built-in token filters. In the following table, some token filters are described:
Token Filter
Description
ASCII Folding Token Filter
This converts alphabetic, numeric, and symbolic unicode

characters that are not in the first 127 ASCII characters.
Length Token Filter
This removes words that are longer or shorter than

specified.
Lowercase Token Filter
This normalizes token text to lower case.
Uppercase Token Filter
This normalizes token text to upper case.
Stop Token Filter
This removes stop words (They are specified words - for

example the, is, are, and so on.) from token streams.
Reverse Token Filter
This simply reverses each token.
Trim Token Filter
This trims the whitespace surrounding a token.
Normalization Token
Filters
These normalize special characters of a certain language

(for example, Arabic, German, Persian).
[ 54 ]
Chapter 4
What's text normalization?

Text normalization is the process of transforming text into a common form. That is
necessary in order to remove insignificant differences among identical words.
Let's look at dj-vu word to handle.
The word deja-vu is not equal to dj-vu for string comparison. Even Dj-vu is
not equal to dj-vu. Similarly, Mich'le is not equal to Michle. All these words
(that is, tokens) are not equal because the comparison is made at the byte-level by
Elasticsearch. This means, for two tokens to be considered the same, they need to
consist of exactly the same bytes when these tokens are compared.
However, these words have similar meanings. In other words, the same thing is
being sought when a user is searching for the word dj-vu and another user,
deja-vu or deja vu. It should also be noted that the Unicode standard allows you to
create equivalent text in multiple ways.
For example, take letters (Latin Capital letter e with grave) and (Latin Capital letter e
with acute). In this case, you may have the same letters encoded in different ways on
your data source. Such reasons are necessary for improving relevant search results.
This is the job of token filters and this process makes tokens more easily searchable.
There are four normalization forms that exist, namely:
NFC
NFD
NFKC
NFKD
NFC is canonical composition and NFKC is compatibility composition. These

forms represent characters in the fewest bytes possible. The original word remains
unchanged in these forms.
NFD is canonical decomposition and NFKD is compatibility decomposition. These
decomposed forms represent characters by their constituent parts.
If you want more information about the unicode normalization
forms, refer to http://unicode.org/reports/tr15/
[ 55 ]
ICU analysis plugin

Elasticsearch has an ICU analysis plugin. You can use this plugin to use mentioned
forms in the previous section, and so ensuring that all of your tokens are in the same
form. Note that the plugin must be compatible with the version of Elasticsearch in
your machine:
bin/plugin install elasticsearch/elasticsearch-analysis-icu/2.7.0
After installing, the plugin registers itself by default under icu_normalizer or

icuNormalizer. You can see an example of the usage as follows:
curl -XPUT /my_index -d '{
"settings": {
"analysis": {
"filter": {
"nfkc_normalizer": {
"type": "icu_normalizer",
"name": "nfkc"
}
},
"analyzer": {
"my_normalizer": {
"tokenizer": "icu_tokenizer",
"filter": [ "nfkc_normalizer" ]
}
}
}
}
}'
The preceding configuration let's normalize all tokens into the NFKC
normalization form.
If you want more information about the ICU, refer to
http://site.icu-project.org. If you want to examine
the plugin, refer to https://github.com/elastic/
elasticsearch-analysis-icu.
ASCII Folding Token filter

The ASCII Folding token filter converts alphabetic, numeric, and symbolic unicode
characters. It determines their corresponding ASCII characters, if a character is not in
the first 127 ASCII characters and, of course, if one exists.
[ 56 ]
Chapter 4
To see how it works, run the following command:

curl -XGET 'l
ocalhost:9200/_analyze?tokenizer=standard&filters=asciifolding&pretty'
-d "Le dj-vu est la sensation d'avoir dj ttmoinoud'avoir dj
vcuune situation prsente"
{
"tokens" : [ {
"token" : "Le",
"start_offset" :0,
"end_offset" :2,
"position" : 1
}, {
"token" : "deja",
"start_offset" :3,
"end_offset" :7,
"position" : 2
}, {
"token" : "vu",
"start_offset" :8,
"end_offset" :10,
"position" : 3
}, {
"token" : "est",
"start_offset" :11,
"end_offset" :14,
"position" : 4
}, {
"token" : "la",
"start_offset" :15,
"end_offset" :17,
"position" : 5
}, {
"token" : "sensation",
"start_offset" :18,
"end_offset" :27,
"position" : 6
}, {
"token" : "d'avoir",
[ 57 ]

"start_offset" :28,
"end_offset" :35,
"position" : 7
}, {
"token" : "deja",
"start_offset" :36,
"end_offset" :40,
"position" : 8
}, {
"token" : "ete",
"start_offset" :41,
"end_offset" :44,
"position" : 9
}, {
"token" : "temoin",
"start_offset" :45,
"end_offset" :51,
"position" : 10
}, {
"token" : "ou",
"start_offset" :52,
"end_offset" :54,
"position" : 11
}, {
"token" : "d'avoir",
"start_offset" :55,
"end_offset" :62,
"position" : 12
}, {
"token" : "deja",
"start_offset" :63,
"end_offset" :67,
"position" : 13
}, {
"token" : "vecu",
"start_offset" :68,
"end_offset" :72,
[ 58 ]
Chapter 4
"position" : 14
}, {
"token" : "une",
"start_offset" :73,
"end_offset" :76,
"position" : 15
}, {
"token" : "situation",
"start_offset" :77,
"end_offset" :86,
"position" : 16
}, {
"token" : "presente",
"start_offset" :87,
"end_offset" :95,
"position" : 17
} ]
}
As you see, even though a user may enter dj, the filter converts it to deja;
likewise, t is being converted to ete. The ASCII Folding token filter doesn't
require any configuration, but, if desired, you can include directly the one in a
custom analyzer as follows:
curl -XPUT localhost:9200/my_index -d '{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}'
[ 59 ]
An Analyzer Pipeline
If we have a good grasp of the analysis process described so far, a pipeline of an
analyzer should be as shown in the following picture:
Text to be analyzed is primarily processed by the character filters. Then, a filter

divides the text by tokenizers and tokens are obtained. In the final step, the token
filters modify tokens.
Specifying the analyzer for a field in

the mapping
You can define an analyzer both in the index_analyzer and the search_analyzer
member over a field in the mapping process. Also, Elasticsearch allows you to use
different analyzers in separate fields.
The following command shows us the mapping for the fields that an analyzer defined:
curl -XPUT localhost:9200/blog -d '{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string", "index_analyzer": "simple"
},
[ 60 ]
Chapter 4
"content": {
"type": "string", "index_analyzer": "whitespace", "search_
analyzer": "standard"
}
}
}
}
}'
{"acknowledged":true}
We defined a simple analyzer to the title field, and whitespace analyzer to the
content field by the preceding configuration. Also, the search analyzer refers to
the standard analyzer in the content field.
Now, we will add a document to the blog index as follows:
curl -XPOST localhost:9200/blog/article -d '{
"title": "My boss's job was eliminated'",
"content": "Hi guys. My boss's job at the office was eliminated due
to budget cuts.'"
}'
{"_index":"blog","_type":"article","_id":"AU-bQRaEOIfz36vMy16h","_
version":1,"created":true}
Now we will search boss's word in the title field:
curl -XGET localhost:9200/blog/_search?pretty -d '{
"query": {
"match": {
"title": "boss's"
}
}
}'
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
[ 61 ]
Oops, we have a problem.

Actually, we have a document that contains the boss's phrase in the title field, but
the document is not returned by the query. Why did this happen?
To answer this question, let's see how the boss's phrase is analyzed in the title
field using the Analyze API:
curl -XGET "localhost:9200/blog/_analyze?field=title&text=boss's&pret
ty"
{
"tokens" : [ {
"token" : "boss",
"start_offset" :0,
"end_offset" :5,
"type" : "word",
"position" : 1
}, {
"token" : "s",
"start_offset" :10,
"end_offset" :11,
"type" : "word",
"position" : 2
} ]
}
As you can see, simple analyzer broke the apostrophe. Now, let's search the phrase
guys in the content field for getting same document:
curl -XGET localhost:9200/blog/_search -d '{
"query": {
"match": {
"content": "guys"
}
}
}'
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
[ 62 ]
Chapter 4
"total": 0,
"max_score": null,
"hits": []
}
}
We have a document that contains the guys phrase in the content field but the
document is not returned by the query. Let's see how the Hi guys. sentence is
analyzed in the content field using the Analyze API:
curl -XGET 'localhost:9200/blog/_analyze?field=content&text=Hi
guys.&pretty'
{
"tokens": [
{
"token": "Hi",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "guys.",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 2
}
]
}
As you can see, the whitespace analyzer broke the space and did not remove the
punctuation. If we recreate the blog index with the following configuration, in both
the preceding query will return documents:
curl -XDELETE localhost:9200/blog
curl -XPUT localhost:9200/blog -d '{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string", "index_analyzer": "simple", "search_
analyzer": "simple"
},
[ 63 ]

"content": {
"type": "string"
}
}
}
}
}'
In the preceding configuration, we defined a simple analyzer to the title field for
indexing and search operation. By default, Elasticsearch applies standard analyzer
for fields of a document. This is why we did not define an analyzer for the content
field. Now the document will return to us when we search a boss's phrase.
To summarize this example, when at first we searched the boss's word in the
title field, Elasticsearch did not return any document to us because we used
simple analyzer for indexing on the title field, and this analyzer divided the text at
non-letters. That means that boss's phrase divided the apostrophe by the simple
analyzer. However, the title field uses standard analyzer at search time. Remember
that we did not define a search analyzer for the title field initially. So, the document
was not returned to us because we used two analyzers that have different behaviors
for indexing and searching. By eliminating these differences, the document was
returned to us.
Keep in mind that the same analyzer used at index time and
at search time is very important for the terms of the query to
match the terms of inverted index.
Creating a custom analyzer

Although the analyzers that come bundled with Elasticsearch are sufficient for many
cases, we may want to use custom analyzers for some special needs by combining
character filters, tokenizers, and token filters in a configuration.
For example, if we include the javaee phrase or j2ee phrase in our article, we
want to analyze them as java enterprise edition. The following is a sample
configuration that allows it:
curl -XPUT localhost:9200/my_index -d '{
"settings" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
[ 64 ]
Chapter 4
"pattern":"j2ee|javaee(.*)",
"replacement":"java enterprise edition $1"
}
},
"analyzer" : {
"my_custom_analyzer" : {
"tokenizer" : "standard",
"filter":
["lowercase"],
"char_filter" : ["my_pattern"]
}
}
}
}
}'
The preceding configuration, firstly, defines a character filter. It is a type of pattern_

replace. We defined a pattern and replacement text for this filter.
Then, we configured our custom analyzer. We gave it a name: my_custom_analyzer.
The analyzer has standard tokenizer and lowercase token filter, and the character
filter is a type of pattern_replace that we just created.
This custom analyzer, firstly, uses the character filter to manipulate the characters.
Then, it divides text at the word boundaries and finally, normalizes token text to
lower case.
Summary
In this chapter, we looked at the analysis process and we reviewed the building
blocks of analyzer. After this, we comprehended what the character filters,
tokenizers, and token filters are, and how to specify different analyzers in separate
fields. Finally, we saw how to create a custom analyzer. In the next chapter, you'll
discover the anatomy of an Elasticsearch cluster, what a shard is, what a replica
shard is, what a function replica shard performs, and so on. In addition, we will
examine the questions, how do we configure my cluster correctly? and how do
we determine the correct number of shard and replicas? We will also look at some
relevant cases related to this topic.
[ 65 ]
Get more information Elasticsearch Indexing
Where to buy this book

You can buy Elasticsearch Indexing from the Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.
www.PacktPub.com
Stay Connected:

Elasticsearch Indexing - Sample Chapter

Uploaded by

Elasticsearch Indexing - Sample Chapter

Uploaded by

Fr

Big data means more information at your disposal, which

What you will learn from this book

Find out how an Elasticsearch cluster works

Prices do not include

Visit www.PacktPub.com for books, eBooks,

Use Elasticsearch analysis and analyzers

community experience distilled

Control document metadata with the correct

Who this book is written for

In this package, you will find:

The author biography

About the Author

What this book covers

Analysis and Analyzers

What is analysis process?

Recording of data by high performance

Analysis and Analyzers

Analysis and Analyzers

Inverted index processes are performed by analyzers. Generally, an analyzer is

This uses Standard Tokenizer to divide text. Other components

This uses Letter Tokenizer to divide text. Another component is

This uses Whitespace Tokenizer to divide text at spaces.

This uses Letter Tokenizer to divide text. Other components are

This uses a regular expression to divide text. It accepts lowercase

A set of analyzers analyze the text for a specific language.

Filtering of the term

Let's look at the main function of how closely it is realized now.

Building blocks of Analyzer

HTML Strip Char filter

Analysis and Analyzers

Pattern Replace Char filter

Analysis and Analyzers

This finds the boundaries between words and then

This divides text at non-letters and converts them to

This divides text at spaces.

This divides text at via a regular expression.

UAX Email URL Tokenizer

This tokenizes e-mails and URLs as single tokens. It

Path Hierarchy Tokenizer

This divides text at delimiters (defaults character

If you want more information about the Unicode Standard Annex

ASCII Folding Token Filter

This converts alphabetic, numeric, and symbolic unicode

Length Token Filter

This removes words that are longer or shorter than

Lowercase Token Filter

This normalizes token text to lower case.

Uppercase Token Filter

This normalizes token text to upper case.

Stop Token Filter

This removes stop words (They are specified words - for

Reverse Token Filter

This simply reverses each token.

Trim Token Filter

This trims the whitespace surrounding a token.

These normalize special characters of a certain language

What's text normalization?

NFC is canonical composition and NFKC is compatibility composition. These

Analysis and Analyzers

ICU analysis plugin

After installing, the plugin registers itself by default under icu_normalizer or

ASCII Folding Token filter

To see how it works, run the following command:

Analysis and Analyzers

Analysis and Analyzers

Text to be analyzed is primarily processed by the character filters. Then, a filter

Specifying the analyzer for a field in

Analysis and Analyzers

Oops, we have a problem.

Analysis and Analyzers

Creating a custom analyzer

The preceding configuration, firstly, defines a character filter. It is a type of pattern_

Get more information Elasticsearch Indexing

Where to buy this book

You might also like