XML indexing using ElasticSearch

Download a sample of XML data on MEDLINE https://goo.gl/LjByLf

See an example of 2 MEDLINE documents.

Configure FileBeat (/etc/filebeat/filebeat.yml)

filebeat.prospectors: # Each - is a prospector. Most options can be set at the prospector level, so# you can use different prospectors for various configurations.# Below are the prospector specific configurations. - input_type: log

# enable this propsector configuration

enable: true

# Paths that should be crawled and fetched. Glob based paths. paths: - ~/MEDLINE/*.xml document_type: message ### Multiline options# Mutiline can be used for log messages spanning multiple lines. This is common# for Java Stack Traces or C-Line Continuation# The regexp Pattern that has to be matched. The example pattern matches all lines starting with <PubMedArticle> multiline.pattern:'^[\s]*<PubmedArticle>'

# Defines if the pattern set under pattern should be negated or not. Default is false. multiline.negate: true# Match can be set to "after" or "before". It is used to define if lines should be append to a pattern# that was (not) matched before or after or as long as a pattern is not matched based on negate.# Note: After is the equivalent to previous and before is the equivalent to to next in Logstash multiline.match: after#================================ Outputs =====================================# Configure what outputs to use when sending the data collected by the beat.# Multiple outputs may be used.#----------------------------- Logstash output -------------------------------- output.logstash: # The Logstash hosts hosts: ["localhost:5044"] # Optional SSL. By default is off.# List of root certificates for HTTPS server verifications #ssl.certificate_authorities: ["/etc/pki/tls/certs/logstash-forwarder.crt"] # Certificate for SSL client authentication#ssl.certificate: "/etc/pki/client/cert.pem"# Client Certificate Key#ssl.key: "/etc/pki/client/cert.key"

Configure Logstash (/etc/logstash/conf.d/MEDLINE.conf)

input {

beats {

port => "5044"

}

# The filter part of this file is commented out to indicate that it is

# optional.

filter {

xml {

source => "message"

store_xml => false

xpath => [

"/PubmedArticle/MedlineCitation/PMID/text()", "[identifier][value]",

"/PubmedArticle/MedlineCitation/Article/Language/text()", "[article][lang]",

"/PubmedArticle/MedlineCitation/Article/Journal/Title/text()", "[journal][title]",

"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/Volume/text()", "[journal][volume]",

"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/Issue/text()", "[journal][issue]",

"/PubmedArticle/MedlineCitation/Article/Pagination/text()", "[article][pagination]",

"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year/text()", "[journal][year_pub]",

"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month/text()", "[journal][month_pub]",

"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Day/text()", "[journal][day_pub]",

"/PubmedArticle/MedlineCitation/Article/ArticleTitle/text()", "[article][title]",

"/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText/text()", "[article][abstract]",

"/PubmedArticle/MedlineCitation/Article/AuthorList/Author/LastName/text()", "[author][lastname]",

"/PubmedArticle/MedlineCitation/Article/AuthorList/Author/ForeName/text()", "[author][firstname]",

"/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/text()", "[mesh][headings]"

]

}

elasticsearch {

add_field => {

"[identifier][type]" => "pmid"

"collection" => "medline"

}

output {

#stdout { codec => rubydebug }

elasticsearch {

hosts => [ "localhost:9200" ]

index => "publications-en"

document_id => "%{collection}-%{[identifier][value]}"

}

Start elasticsearch

$sudo service elasticsearch start

Create an index namely 'publications-en'

curl -XPUT 'localhost:9200/publications-en?pretty' -H 'Content-Type: application/json' -d'

{

"settings": {

"analysis": { "filter": { "index_filter": { "type": "common_grams", "common_words": "_english_" }, "search_filter": { "type": "common_grams", "common_words": "_english_", "query_mode": true } }, "analyzer": { "index_grams": { "tokenizer": "standard", "filter": [ "lowercase", "index_filter" ] }, "search_grams": { "tokenizer": "standard", "filter": [ "lowercase", "search_filter" ] } } } }

}

You can also define a template that will be applied for the newly created index.

Start logstash

$su

$cd /usr/share/logstash

$bin/logstash -f /etc/^Cgstash/conf.d/medline-pipeline.conf --config.reload.automatic

Start filebeat

$su

$cd /usr/share/filebeat

$bin/filebeat -e -c /etc/filebeat/filebeat-medline.yml -d "publish"

Check indices

http://localhost:9200/_cat/indices?v

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open publications-en RJO3nDP5TqizPeYqF_3knA 5 1 178 0 1mb 1mb

If the 'publications-en' index exists, try to run a query to retrieve some results

A basic query in ES

http://localhost:9200/_search?q=query&pretty=true&size=50

Search for documents containing 'public health' in the article title

http://localhost:9200/_search?q=article.title=public+health&pretty=true&size=50

Search for documents for which the author's lastname is "Duggan"

http://localhost:9200/_search?q=author.lastname=Duggan&pretty=true&size=50

Search for documents for which the topics are about 'Neoplasms'

http://localhost:9200/_search?q=mesh:neoplasms&pretty=true&size=50

List documents in the MEDLINE collection

http://localhost:9200/_search?q=collection:medline&pretty=true&size=50

A Shell script to refresh the pipeline FileBeat -> Logstash -> indexing with ES

---

#!/bin/bash# remove the FileBeat's data registry rm -rf /usr/share/filebeat/bin/data/registry# remove medline index curl -XDELETE 'localhost:9200/medline?pretty'# run filebeat/usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat-medline.yml -d "publish"

Returns the results with keywords highlighted in the article title:

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'{ "query" : {"match": { "article.title": "DNA" }}, "highlight" : { "fields" : { "article.title" : {}}}}'

Multi_match: match keywords in multiple fields

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'{ "query" : { "multi_match": { "query": "pharmacology", "fields": ["article.title", "article.abstract"] } }, "highlight" : { "fields" : { "article.title": {}, "article.abstract" : {}}}}'

References:

https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html

https://stackoverflow.com/questions/24552512/multiline-pattern-for-logstash

https://www.elastic.co/guide/en/logstash/current/plugins-codecs-multiline.html#plugins-codecs-multiline-negate

https://www.elastic.co/guide/en/elasticsearch/guide/master/common-grams.html

Page updated

Google Sites

Report abuse