DRMC

71days since
Next DRMC Meeting

Stretch Armstrong -- because 'DSpace multi-million record cloud-based stress test' takes too long to say

At OhioLINK we've built a federation of DSpace instances across the state called the Digital Resource Commons, and we hope to expand our offerings beyond the academic library community. As part of that research, we are performing a multi-million record stress test of DSpace software using Amazon's Elastic Compute Cloud (EC2). This page holds some of our early results.

Batch Submission
We have noticed longer batch submission times as our production repository grew beyond 200,000 items and knew we needed to find a solution if we hoped to scale to the 10-20 million record level. (Of course, we are also concerned with the impact of repository size on browse and search functionality.) Amazon's Cloud Computing web services gave us a perfect platform for rapidly building and testing multiple virtual machine configurations.

The graph below shows the results of several batch submissions performed on an Amazon Machine Image created with a stock installation of DSpace 1.5.2. It uses a default EC2 machine instance with 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit*), 160 GB of instance storage, 32-bit Linux OS.

This process was primarily designed to confirm the results of the ROAD Project  test mentioned by Stewart Lewis. In Stewart's scenario, the entire 300,000 record submission took place at one time. We wanted to see if the problem exists even when the submission is broken up into several smaller blocks and takes place over a period of days. To reduce complexity in the data set, we replicated a single electronic journal article in PDF and JPG format, and submitted it along with it's descriptive metadata in batches of 10, 20, or 30 thousand.
































(*One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.)

As you can see, the slope of this line is constant across multiple loads and multiple days. "Houston, we have a problem."

Increasing Heap Space
Editing the dsrun file in the [dspace_install]/bin directory does not increase load speed, in fact our numbers show a slight decrease. However the ability of the batch process to continue operating without an out of memory error increases, greatly improving the ability of the batch submission process to handle large runs of data at one time.
#Allow user to specify java options through JAVA_OPTS variable
if [ "$JAVA_OPTS" = "" ]; then
   #Default Java to use 256MB of memory
   JAVA_OPTS=-Xmx256m





ItemImport Performance
Obviously, if we are going to successfully test multiple millions of records we either need to fix the batch submission process, or find a way to place items in DSpace without using it.

    Fixing ItemImport - Performance Report
    We ran a batch submission of 4,328 items five times in a row and watched the 'most expensive' methods. All of these remained in roughly the same general range except org.dspace.core.Context.commit() which increased from a 124 ms average to 291, 386, 664, and 670.
    
 Method  Calls Cumulative Time (ms)
Method Time (ms)
 Avg. Cum. Time(ms)  Avg. Method Time (ms)
 Catches  Exception Exits
 org.dspace.content.Bundle.createBitstream(java.io.InputStream)  20,507  1,734,168  1,734,168  85  85  0  0
 org.dspace.core.Context.commit()  4,328  535,418  535,418  124  124  0  0
 org.dspace.content.InstallItem.installItem(org.dspace.core.Context, org.dspace.content.InProgressSubmission, java.lang.String)  4,328  236,297  236,297  55  55  0  0
 org.dspace.content.Item.createBundle(java.lang.String)  17,312  190,970  190,970  11  11  0  0
 javax.xml.parsers.DocumentBuilder.parse(java.io.File)  4,328  142,118  142,118  33  33  0  0
 org.dspace.content.WorkspaceItem.create(org.dspace.core.Context, org.dspace.content.Collection, boolean)  4,328  59,249  59,249  14  14  0  0
 java.io.PrintStream.println(java.lang.String)  117,443  54,145  54,145  0  0  0  0
 java.io.BufferedReader.readLine()  24,835  42,925  42,925  2  2  0  0
 org.dspace.content.Bitstream.update()  20,507  25,250  25,250  1  1  0  0
 org.apache.xpath.XPathAPI.selectNodeList(org.w3c.dom.Node, java.lang.String)  8,656  18,744  18,744  2  2  0 0

    Garbage Collection efficiencies
    <graph>

    Bypassing
ItemImport -
    (SRB/iRODS, Database replacement, multiple assetstores. Hadoop, etc.)

CPU Utilization - Amazon EC2 Instance Types
Two tests reveal information about the CPU utilization of the DSpace submission process on multi-core machine instances. There is plenty of headroom for a high cpu, medium instance to handle additional import streams without slowing down the process.


Small Instance

1.7 GB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
160 GB instance storage (150 GB plus 10 GB root partition)
32-bit platform
I/O Performance: Moderate
Price: $0.10 per instance hour



High-CPU Medium Instance

1.7 GB of memory
5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each)
350 GB of instance storage
32-bit platform
I/O Performance: Moderate
Price: $0.20 per instance hour



More Power!
Like any software development issue, there are two obvious solutions: fine tune the code, or throw more hardware at the problem. Performing an identical batch submission on these two machine types shows there is noticeable benefit to be gained from beefing up the system.


Swapping PostgreSQL for Oracle
Since DSpace allows for changing the database from Postgres to Oracle by changing a few configuration file parameters, and since Oracle 11g is AWS compatible, it makes sense swap the back edn to Oracle and run some tests using the Oracle SQL Tuner and the Automatic Tuning Optimizer.
<data goes here>


Search and Browse timings
<Selenium test data>


Cloud Computing Expenditures
- pay-as-you-go computing
For those interested in the costs we've encountered during this process, here's a slightly redacted billing page. You will notice during the last week we have performed nearly 240 hours of testing, during which time we created and destroyed more than a dozen machines.


Making your own DSpace 1.5.2 instance in the Cloud
The OhioLINK Amazon Machine Image is currently private. However the DSpace Foundation is testing a pre-built and pre-configured virtual application that promises to run on VMware, Xen, Parallels, Virtual Iron, Microsoft Virtualization, and Amazon EC2.