Optimizing Document Transfer to ElasticSearch Using MongoDB River

Post date: Jun 2, 2013 8:05:50 AM

If your data is already stored into MongoDB and you want to benefit of the full text search capabilities of Elastic Search then you can use the MongoDB River Plugin located at https://github.com/richardwilly98/elasticsearch-river-mongodb in order to bulk transfer data between these two sources. The architecture consists of a MongoDB replica set with 3 nodes and an Elastic Search cluster containing 3 nodes, too. These 2 clusters can be launched locally. In order to be able to transfer data between these clusters, you need to install the aforementioned plugin and then perform the following steps:

1) Create the mapping configuration for your "_river" by specifying MongoDB database, collection and credentials and the name of the Elastic Search index that is to be created for indexing MongoDB documents. By default the bulk size used by the River Indexer is 100 documents per second. This setting is a specification on best effort basis, so that it might be possible that in some seconds this number with the less than the configuration specified, this amount depending on actual number of documents to be processed and on their size. Such a mongo_transfer_config.json configuration would look like this:

{

"type": "mongodb",

"mongodb": {

"servers": [

{"host": "localhost", "port": 27111},

{"host": "localhost", "port": 27112},

{"host": "localhost", "port": 27113}

],

"credentials": [

{ db: "admin", user: "adminUser", password: "adminPassword" },

],

"db": "library",

"collection": "books"

},

"index": {

"name": "mongo_books"

}

}

2) Create the _river index using the following command:

curl -XPUT 'localhost:9200/_river/mongolink/_meta' -d @../config/mongo_transfer_config.json

3) Verify that the _river has been created successfully:

curl -XGET 'localhost:9200/_river/mongolink/_status?pretty'

4) Verify how many MongoDB documents have been indexed into Elastic Search (in the index configured previously via the _river meta information):

curl -XGET 'localhost:9200/mongo_books/_count?pretty'

5) For 1 million documents you might find out that the time to process all these documents is long, for example more than one hour. In order to optimize it, you need to drop the _river and to recreate it by specifying another bulk size value (for example 500 like given below):

{

"type": "mongodb",

"mongodb": {

"servers": [

{"host": "localhost", "port": 27111},

{"host": "localhost", "port": 27112},

{"host": "localhost", "port": 27113}

],

"credentials": [

{ db: "admin", user: "adminUser", password: "adminPassword" },

],

"db": "library",

"collection": "books"

},

"index": {

"name": "mongo_books",

"bulk_size": 500

}

}

In the Elastic Search log it can be seen how many documents can be actually processed per second, it can be the case that less than the specified number of documents can be processed. Such a value that appears more often than others migtht be observed and this value usually gives you the limit for optimization at this level. From this value on the response time gains from optimizing at this level might be insignificant which is a sign that other options might be taken into consideration (such as increasing hardware resources or other Elastic Search level tuning applicable).