ElasticSearch
Elasticsearch is one of the most popular open source technologies which allows you to build and deploy efficient and robust search quickly
A web crawler basically crawls across all the pages following links as it sees them in order to create a massive corpus of all documents that exist. Every document found by the web crawler is indexed. That is, it is parsed and tokenized, the individual terms extracted and stored in a data structure called an inverted index. The inverted index is a mapping from a term to the document where that term is found. It's inverted because it goes from the search term to the web page of the document. Based on what a user searched for, every document in the corpus will have an associated score called the relevant score.
Inverted Index - Is the heart of any search engines. The mapping of words to frequencies and the corresponding source documents is the inverted index. Dictionary - The words and their corresponding frequencies are called the dictionary. Posting are the source document where the terms are found for the search. In search terminology it is called posting list.
Lucene - ES is built on top on Lucene. There are other technologies which use Lucene and has its own ecosystem. Solr - which has distributed indexing, load balancing, replication, centralised configuration etc. Nutch - provides web crawling and index parsing. CrateDB - open source SQL distributed database.
Distributed - It runs on multiple nodes within a cluster and can scale to thousands of nodes, which means that the performance of your search operations can scale linearly with the number of nodes that you add to the cluster, an important consideration as your data grows in size.
Highly Available and fault tolerant because multiple copies of your data are stored within the cluster. Every index is replicated
REST-API - to perform operations such as creating, retrieving, updating, and deleting indices and their corresponding documents. Administrative and searching analysis operations--all of these can be performed with simple REST calls by passing in data in JSON format
Query DSL - DSL or domain specific language, which allows you to express very complex queries in a very simple manner using JSON. Elasticsearch is also schema-less
During ES setup, set hostname node.name and cluster name default cluster name is 'elasticsearch'. The way you scale the number of machines within a cluster is to have multiple nodes join the same cluster by specifying the name of the cluster. Nodes within a cluster will automatically find each other within Elasticsearch by sending each other messages when machines are on the same network.
INDEX
The index in Elasticsearch refers to a collection of similar documents. Documents need not be exactly the same. An index within a cluster is uniquely identified by its name, and you can have any number of indices for the same cluster.
Typically, the way you'd set up your indices is to logically break up your data based on what you search within it. Let's say in an E-commerce site, you have your catalog in one index. You might have all your customer information in a different index. The customer information might be internal. Inventory management for your E-commerce site might be a third index. Design your index based on broad top-level categories. Within one index, you can have multiple document types. Think of these document types as the logical partitioning of documents within one index. . For example, in the blogging site blog posts might be one type of document. Blog comments might be a different type of document.
Document types are user defined. They're based on the semantics of your website. Typically, all documents which have the same set of fields belong to one type. A document type is a collection of documents with the same characteristics within a larger index
DOCUMENT
The document is simply a container of text that needs to be searchable. A document in Elasticsearch is expressed using JSON, the JavaScript object notation, which is a standard format for wire transfer from a client to a server or vice versa. All documents in Elasticsearch reside within an index. In fact, every document requires a document type, as well as an index that it belongs to. The index and the document type are together used to identify a set of documents
If the number of documents that you want to search is very large, or the size of each document is huge, then your index may be too large to fit on a single node. A single node has other disadvantages as well. It's possible that if you try to serve all your search requests from one node, your search operation will be incredibly slow.
And the way to solve for both of these is to have your index split up across multiple machines in the cluster. This process is called sharding your data. So split the contents of the index across multiple nodes, each individual node contains one shard of the index. this process is called sharding your index, and this means that every node will have only a subset of your index data.
An index in Elasticsearch can be split up across multiple machines, and individual portions of the index on these machines are called shards. A shard can be replicated zero or more times. You can have as many replicas as you think you need. By default, an index in Elasticsearch has five shards and one replica. So every shard has one backup copy.
Each shard is a self-contained Lucene index of its own.
Monitoring Health
Cluster status of Green - All shards and replicas within the cluster are functional
Yellow - Functional but - certain replicas are not available
Red - Not-Functional -neither replicas nor shards available
Format of restapi command
<Rest Verb> <Index> <Type> <ID>
curl localhost9200 - provides the system status of ES
curl -XGET localhost:9200/_cat/nodes?v&pretty - provides the health of the ES and the nodes
List indicies -curl -XGET 'localhost:9200/_cat/indices?v&pretty'
Note: PUT are idempotent and it returns the same result even if it is run several times where else POST is not and only used for updates.
CRUD OPERATION
Create Index - curl -XPUT 'localhost:9200/products?&pretty'
Index and Query Document:
Put Document
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/products/mobiles/1?pretty' -d'
{
"name": "iPhone 7",
"camera": "12MP",
"storage": "256GB",
"display": "4.7inch",
"battery": "1,960mAh",
"reviews": ["Incredibly happy after having used it for one week", "Best iPhone so far", "Very expensive, stick to Android"]
}
'
In the above example '1?pretty' in the url is the document index which can be used for referencing the document.
(sample data of movies at movielense https://grouplens.org/datasets/movielens/)
## Overall Cluster Health
GET /_cat/health?v
## Node Health
GET /_cat/nodes?v
## List Indices
GET /_cat/indices?v
## Create 'sales' Index
PUT /sales
## Add 'order' to 'sales' index
PUT /sales/order/123
{
"orderID":"123",
"orderAmount":"500"
}
## Retrieve document
GET /sales/order/123
## Delete index
DELETE /sales
## List indices
GET /_cat/indices?v
##LOAD BULK
curl -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/shakespeare/plays/_bulk?pretty&refresh' --data-binary @"shakespeare_6.0.json"
##QUERY
GET /shakespeare/plays/_search?q=play_name=HAMLET
GET /shakespeare/_search
{
"query": {
"query_string": {
"default_field": "play_name",
"query": "MACBETH"
}
}
}
curl “localhost:9200/_search?q=name:john~1 AND (age:[30 TO 40} OR surname:K*) AND -city”
CREATE INDEX WITH DATE - This is needed to setup grafana, as default date is removed in es5
PUT my_index { "mappings": { "my_type": { "properties": { "date": { "type": "date" } } } }}
PUT my_index/my_type/1{ "date": "2015-01-01" }
Creating NEW INDEX with shards and replicas
PUT /new_index
{
"settings":{
"number_of_shards": 10,
"number_of_replicas": 1
}
}
You can index templates to automatically apply mappings, analyses, aliases etc.
Alias rotation example
POST /_aliases
{
"actions":[
{ "add": { "alias": "logs_current", "index":logs_2018_06"}}, # Add current month
{ "remove": { "alias": "logs_current", "index":logs_2018_05"}}, # Remove previous month
{ "add": { "alias": "logs_last_3_months", "index":logs_2018_06"}}, # Add logs for 3 months
{ "remove": { "alias": "logs_last_3_months", "index":logs_2018_06"}},
# Remove logs before 3 month
]
}
HARDWARE REQUIREMENT
64GB PER MACHINE is the sweet spot ( 32GB to elastic search, 32 GB to OS/disk cache for lucene)
Don't go above 32GB for elastic search. Elastics search is IO intensive and memory intensive by not CPU intensive. Don't use NAS. Use medium to large, too big is bad, and too many small boxes is bad too.
HEAP SIZING
Default heap size of ES is 1GB. Half or less of your physical memory should be allocated to ES. The other half can be used to lucene for caching. If you are not aggregating on analysed string fields, consider using less than half for ES. Smaller heaps result in faster garbage collection and more memory for caching. Heap can be set using
export ES_HEAP_SIZE=10g or
ES_JAVA_OPTS="-Xms10g -Xmx10g" .bin/elasticsearch
X-PACK for monitoring ES
Install - sudo bin/elasticsearch-plugin install x-pack
Security Disable for X-pack - sudo vi /etc/elasticsearch/elasticsearch.yml and add xpack.security.enabled:false
watcher-history keeps monitoring for all errors happening on ES. To get the logs
GET .watcher-history*/_search?pretty
{
"sort" : [
{ "result.execution_time": "desc" }
]
}
Running ES on Cluster
Exit elasticseach.yml and uncomment the cluster.name and node.name. If you want to run multiple instances of ES on the same machine add node.max_local_storage_nodes: 3
Note: All instances of ES should have the same cluster name and different node names.
When new node is added to the cluster, shards automatically gets distributed to the new node. At the even of failure of node the rebalance of nodes takes place automatically.
curl -XGET <IP>:9200/_cluster/health?pretty - To get the cluster state
SNAPSHOTS - backup mechanism provided by ES, can be incremental backup
1. Add a repository - edit elasticsearch.yml and add provide repo path path.reps: [<path>]
2. Send a PUT command
PUT _snapshot/backup-repo
{
"type: "fs",
"settings":{
"location": " <repo-path/backup-rep>"
}
}
Once the above steps are followed, snapshot setup is complete. To create snap shot, use the below commands
Create snapshoshot for all open indices:
PUT _snapshot/backup-repo/snapshot-1
get information about a snapshot:
GET _snapshot/backup-repo/snapshot-1
monitor snapshot progress:
GET _snapshot/backup-repo/snapshot-1/status
restore a snapshot of all indices:
POST /_all_close # close all indices, so no one is writing
POST _snapshot/backup-repo/snapshot-1/_restore
Rolling Restart
Main thing to note in restart is that we should disable index re-allocating. ES will reallocate indices if a node goes down, to stop ES for reallocation for rolling restarts disables index reallocation.
1. stop indexing new data ( if possible)
2. disable shard allocation
3. shut down one node
4. perform maintenance and restart, confirm it joins the cluster
5. re-enable shard allocation
6. wait for cluster to return to green status
7. repeat step 2-6 for other nodes
8. resume indexing new data
Disable shard Allocation
PUT _cluster/settings
{
"transient":{
"cluster.routing.allocation.enable": "none"
}
}
Enable shard Allocation
PUT _cluster/settings
{
"transient":{
"cluster.routing.allocation.enable": "all"
}
}