Download a sample of XML data on MEDLINE https://goo.gl/LjByLf
See an example of 2 MEDLINE documents.
Configure FileBeat (/etc/filebeat/filebeat.yml)
filebeat.prospectors: # Each - is a prospector. Most options can be set at the prospector level, so# you can use different prospectors for various configurations.# Below are the prospector specific configurations. - input_type: log
# enable this propsector configuration
enable: true
# Paths that should be crawled and fetched. Glob based paths. paths: - ~/MEDLINE/*.xml document_type: message ### Multiline options# Mutiline can be used for log messages spanning multiple lines. This is common# for Java Stack Traces or C-Line Continuation# The regexp Pattern that has to be matched. The example pattern matches all lines starting with <PubMedArticle> multiline.pattern:'^[\s]*<PubmedArticle>'
# Defines if the pattern set under pattern should be negated or not. Default is false. multiline.negate: true# Match can be set to "after" or "before". It is used to define if lines should be append to a pattern# that was (not) matched before or after or as long as a pattern is not matched based on negate.# Note: After is the equivalent to previous and before is the equivalent to to next in Logstash multiline.match: after#================================ Outputs =====================================# Configure what outputs to use when sending the data collected by the beat.# Multiple outputs may be used.#----------------------------- Logstash output -------------------------------- output.logstash: # The Logstash hosts hosts: ["localhost:5044"] # Optional SSL. By default is off.# List of root certificates for HTTPS server verifications #ssl.certificate_authorities: ["/etc/pki/tls/certs/logstash-forwarder.crt"] # Certificate for SSL client authentication#ssl.certificate: "/etc/pki/client/cert.pem"# Client Certificate Key#ssl.key: "/etc/pki/client/cert.key"
Configure Logstash (/etc/logstash/conf.d/MEDLINE.conf)
input {
beats {
port => "5044"
}
}
# The filter part of this file is commented out to indicate that it is
# optional.
filter {
xml {
source => "message"
store_xml => false
xpath => [
"/PubmedArticle/MedlineCitation/PMID/text()", "[identifier][value]",
"/PubmedArticle/MedlineCitation/Article/Language/text()", "[article][lang]",
"/PubmedArticle/MedlineCitation/Article/Journal/Title/text()", "[journal][title]",
"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/Volume/text()", "[journal][volume]",
"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/Issue/text()", "[journal][issue]",
"/PubmedArticle/MedlineCitation/Article/Pagination/text()", "[article][pagination]",
"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year/text()", "[journal][year_pub]",
"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month/text()", "[journal][month_pub]",
"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Day/text()", "[journal][day_pub]",
"/PubmedArticle/MedlineCitation/Article/ArticleTitle/text()", "[article][title]",
"/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText/text()", "[article][abstract]",
"/PubmedArticle/MedlineCitation/Article/AuthorList/Author/LastName/text()", "[author][lastname]",
"/PubmedArticle/MedlineCitation/Article/AuthorList/Author/ForeName/text()", "[author][firstname]",
"/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/text()", "[mesh][headings]"
]
}
elasticsearch {
add_field => {
"[identifier][type]" => "pmid"
"collection" => "medline"
}
}
}
output {
#stdout { codec => rubydebug }
elasticsearch {
hosts => [ "localhost:9200" ]
index => "publications-en"
document_id => "%{collection}-%{[identifier][value]}"
}
}
Start elasticsearch
$sudo service elasticsearch start
Create an index namely 'publications-en'
curl -XPUT 'localhost:9200/publications-en?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": { "filter": { "index_filter": { "type": "common_grams", "common_words": "_english_" }, "search_filter": { "type": "common_grams", "common_words": "_english_", "query_mode": true } }, "analyzer": { "index_grams": { "tokenizer": "standard", "filter": [ "lowercase", "index_filter" ] }, "search_grams": { "tokenizer": "standard", "filter": [ "lowercase", "search_filter" ] } } } }
}
'
You can also define a template that will be applied for the newly created index.
Start logstash
$su
$cd /usr/share/logstash
$bin/logstash -f /etc/^Cgstash/conf.d/medline-pipeline.conf --config.reload.automatic
Start filebeat
$su
$cd /usr/share/filebeat
$bin/filebeat -e -c /etc/filebeat/filebeat-medline.yml -d "publish"
Check indices
http://localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open publications-en RJO3nDP5TqizPeYqF_3knA 5 1 178 0 1mb 1mb
If the 'publications-en' index exists, try to run a query to retrieve some results
A basic query in ES
http://localhost:9200/_search?q=query&pretty=true&size=50
Search for documents containing 'public health' in the article title
http://localhost:9200/_search?q=article.title=public+health&pretty=true&size=50
Search for documents for which the author's lastname is "Duggan"
http://localhost:9200/_search?q=author.lastname=Duggan&pretty=true&size=50
Search for documents for which the topics are about 'Neoplasms'
http://localhost:9200/_search?q=mesh:neoplasms&pretty=true&size=50
List documents in the MEDLINE collection
http://localhost:9200/_search?q=collection:medline&pretty=true&size=50
A Shell script to refresh the pipeline FileBeat -> Logstash -> indexing with ES
---
#!/bin/bash# remove the FileBeat's data registry rm -rf /usr/share/filebeat/bin/data/registry# remove medline index curl -XDELETE 'localhost:9200/medline?pretty'# run filebeat/usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat-medline.yml -d "publish"
Returns the results with keywords highlighted in the article title:
curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'{ "query" : {"match": { "article.title": "DNA" }}, "highlight" : { "fields" : { "article.title" : {}}}}'
Multi_match: match keywords in multiple fields
curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'{ "query" : { "multi_match": { "query": "pharmacology", "fields": ["article.title", "article.abstract"] } }, "highlight" : { "fields" : { "article.title": {}, "article.abstract" : {}}}}'
References:
https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html
https://stackoverflow.com/questions/24552512/multiline-pattern-for-logstash
https://www.elastic.co/guide/en/elasticsearch/guide/master/common-grams.html