ElasticSearch

Elasticsearch is one of the most popular open source technologies which allows you to build and deploy efficient and robust search quickly

A web crawler basically crawls across all the pages following links as it sees them in order to create a massive corpus of all documents that exist. Every document found by the web crawler is indexed. That is, it is parsed and tokenized, the individual terms extracted and stored in a data structure called an inverted index. The inverted index is a mapping from a term to the document where that term is found. It's inverted because it goes from the search term to the web page of the document. Based on what a user searched for, every document in the corpus will have an associated score called the relevant score.

Inverted Index - Is the heart of any search engines. The mapping of words to frequencies and the corresponding source documents is the inverted index. Dictionary - The words and their corresponding frequencies are called the dictionary. Posting are the source document where the terms are found for the search. In search terminology it is called posting list.

Lucene - ES is built on top on Lucene. There are other technologies which use Lucene and has its own ecosystem. Solr - which has distributed indexing, load balancing, replication, centralised configuration etc. Nutch - provides web crawling and index parsing. CrateDB - open source SQL distributed database.

Distributed - It runs on multiple nodes within a cluster and can scale to thousands of nodes, which means that the performance of your search operations can scale linearly with the number of nodes that you add to the cluster, an important consideration as your data grows in size.

Highly Available and fault tolerant because multiple copies of your data are stored within the cluster. Every index is replicated

REST-API - to perform operations such as creating, retrieving, updating, and deleting indices and their corresponding documents. Administrative and searching analysis operations--all of these can be performed with simple REST calls by passing in data in JSON format

Query DSL - DSL or domain specific language, which allows you to express very complex queries in a very simple manner using JSON. Elasticsearch is also schema-less

During ES setup, set hostname node.name and cluster name default cluster name is 'elasticsearch'. The way you scale the number of machines within a cluster is to have multiple nodes join the same cluster by specifying the name of the cluster. Nodes within a cluster will automatically find each other within Elasticsearch by sending each other messages when machines are on the same network.

INDEX

The index in Elasticsearch refers to a collection of similar documents. Documents need not be exactly the same. An index within a cluster is uniquely identified by its name, and you can have any number of indices for the same cluster.

Typically, the way you'd set up your indices is to logically break up your data based on what you search within it. Let's say in an E-commerce site, you have your catalog in one index. You might have all your customer information in a different index. The customer information might be internal. Inventory management for your E-commerce site might be a third index. Design your index based on broad top-level categories. Within one index, you can have multiple document types. Think of these document types as the logical partitioning of documents within one index. . For example, in the blogging site blog posts might be one type of document. Blog comments might be a different type of document.

Document types are user defined. They're based on the semantics of your website. Typically, all documents which have the same set of fields belong to one type. A document type is a collection of documents with the same characteristics within a larger index

DOCUMENT

The document is simply a container of text that needs to be searchable. A document in Elasticsearch is expressed using JSON, the JavaScript object notation, which is a standard format for wire transfer from a client to a server or vice versa. All documents in Elasticsearch reside within an index. In fact, every document requires a document type, as well as an index that it belongs to. The index and the document type are together used to identify a set of documents

If the number of documents that you want to search is very large, or the size of each document is huge, then your index may be too large to fit on a single node. A single node has other disadvantages as well. It's possible that if you try to serve all your search requests from one node, your search operation will be incredibly slow.

And the way to solve for both of these is to have your index split up across multiple machines in the cluster. This process is called sharding your data. So split the contents of the index across multiple nodes, each individual node contains one shard of the index. this process is called sharding your index, and this means that every node will have only a subset of your index data.

An index in Elasticsearch can be split up across multiple machines, and individual portions of the index on these machines are called shards. A shard can be replicated zero or more times. You can have as many replicas as you think you need. By default, an index in Elasticsearch has five shards and one replica. So every shard has one backup copy.

Each shard is a self-contained Lucene index of its own.

Monitoring Health

Cluster status of Green - All shards and replicas within the cluster are functional

Yellow - Functional but - certain replicas are not available

Red - Not-Functional -neither replicas nor shards available

Format of restapi command

<Rest Verb> <Index> <Type> <ID>

curl localhost9200 - provides the system status of ES

curl -XGET localhost:9200/_cat/nodes?v&pretty - provides the health of the ES and the nodes

List indicies -curl -XGET 'localhost:9200/_cat/indices?v&pretty'

Note: PUT are idempotent and it returns the same result even if it is run several times where else POST is not and only used for updates.

CRUD OPERATION

Create Index - curl -XPUT 'localhost:9200/products?&pretty'

Index and Query Document:

Put Document

curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/products/mobiles/1?pretty' -d'

{

"name": "iPhone 7",

"camera": "12MP",

"storage": "256GB",

"display": "4.7inch",

"battery": "1,960mAh",

"reviews": ["Incredibly happy after having used it for one week", "Best iPhone so far", "Very expensive, stick to Android"]

}

'

In the above example '1?pretty' in the url is the document index which can be used for referencing the document.

(sample data of movies at movielense https://grouplens.org/datasets/movielens/)

## Overall Cluster Health

GET /_cat/health?v

## Node Health

GET /_cat/nodes?v

## List Indices

GET /_cat/indices?v

## Create 'sales' Index

PUT /sales

## Add 'order' to 'sales' index

PUT /sales/order/123

{

"orderID":"123",

"orderAmount":"500"

}

## Retrieve document

GET /sales/order/123

## Delete index

DELETE /sales

## List indices

GET /_cat/indices?v

##LOAD BULK

curl -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/shakespeare/plays/_bulk?pretty&refresh' --data-binary @"shakespeare_6.0.json"

##QUERY

GET /shakespeare/plays/_search?q=play_name=HAMLET

GET /shakespeare/_search

{

"query": {

"query_string": {

"default_field": "play_name",

"query": "MACBETH"

}

}

}

curl “localhost:9200/_search?q=name:john~1 AND (age:[30 TO 40} OR surname:K*) AND -city”

CREATE INDEX WITH DATE - This is needed to setup grafana, as default date is removed in es5

PUT my_index { "mappings": { "my_type": { "properties": { "date": { "type": "date" } } } }}

PUT my_index/my_type/1{ "date": "2015-01-01" }

Creating NEW INDEX with shards and replicas

PUT /new_index

{

"settings":{

"number_of_shards": 10,

"number_of_replicas": 1

}

}

You can index templates to automatically apply mappings, analyses, aliases etc.

Alias rotation example

POST /_aliases

{

"actions":[

{ "add": { "alias": "logs_current", "index":logs_2018_06"}}, # Add current month

{ "remove": { "alias": "logs_current", "index":logs_2018_05"}}, # Remove previous month

{ "add": { "alias": "logs_last_3_months", "index":logs_2018_06"}}, # Add logs for 3 months

{ "remove": { "alias": "logs_last_3_months", "index":logs_2018_06"}},

# Remove logs before 3 month

]

}

HARDWARE REQUIREMENT

64GB PER MACHINE is the sweet spot ( 32GB to elastic search, 32 GB to OS/disk cache for lucene)

Don't go above 32GB for elastic search. Elastics search is IO intensive and memory intensive by not CPU intensive. Don't use NAS. Use medium to large, too big is bad, and too many small boxes is bad too.

HEAP SIZING

Default heap size of ES is 1GB. Half or less of your physical memory should be allocated to ES. The other half can be used to lucene for caching. If you are not aggregating on analysed string fields, consider using less than half for ES. Smaller heaps result in faster garbage collection and more memory for caching. Heap can be set using

export ES_HEAP_SIZE=10g or

ES_JAVA_OPTS="-Xms10g -Xmx10g" .bin/elasticsearch

X-PACK for monitoring ES

Install - sudo bin/elasticsearch-plugin install x-pack

Security Disable for X-pack - sudo vi /etc/elasticsearch/elasticsearch.yml and add xpack.security.enabled:false

watcher-history keeps monitoring for all errors happening on ES. To get the logs

GET .watcher-history*/_search?pretty

{

"sort" : [

{ "result.execution_time": "desc" }

]

}

Running ES on Cluster

Exit elasticseach.yml and uncomment the cluster.name and node.name. If you want to run multiple instances of ES on the same machine add node.max_local_storage_nodes: 3

Note: All instances of ES should have the same cluster name and different node names.

When new node is added to the cluster, shards automatically gets distributed to the new node. At the even of failure of node the rebalance of nodes takes place automatically.

curl -XGET <IP>:9200/_cluster/health?pretty - To get the cluster state

SNAPSHOTS - backup mechanism provided by ES, can be incremental backup

1. Add a repository - edit elasticsearch.yml and add provide repo path path.reps: [<path>]

2. Send a PUT command

PUT _snapshot/backup-repo

{

"type: "fs",

"settings":{

"location": " <repo-path/backup-rep>"

}

}

Once the above steps are followed, snapshot setup is complete. To create snap shot, use the below commands

Create snapshoshot for all open indices:

PUT _snapshot/backup-repo/snapshot-1

get information about a snapshot:

GET _snapshot/backup-repo/snapshot-1

monitor snapshot progress:

GET _snapshot/backup-repo/snapshot-1/status

restore a snapshot of all indices:

POST /_all_close # close all indices, so no one is writing

POST _snapshot/backup-repo/snapshot-1/_restore

Rolling Restart

Main thing to note in restart is that we should disable index re-allocating. ES will reallocate indices if a node goes down, to stop ES for reallocation for rolling restarts disables index reallocation.

1. stop indexing new data ( if possible)

2. disable shard allocation

3. shut down one node

4. perform maintenance and restart, confirm it joins the cluster

5. re-enable shard allocation

6. wait for cluster to return to green status

7. repeat step 2-6 for other nodes

8. resume indexing new data

Disable shard Allocation

PUT _cluster/settings

{

"transient":{

"cluster.routing.allocation.enable": "none"

}

}

Enable shard Allocation

PUT _cluster/settings

{

"transient":{

"cluster.routing.allocation.enable": "all"

}

}