![]() Batch Submission We have noticed longer batch submission times as our production repository grew beyond 200,000 items and knew we needed to find a solution if we hoped to scale to the 10-20 million record level. (Of course, we are also concerned with the impact of repository size on browse and search functionality.) Amazon's Cloud Computing web services gave us a perfect platform for rapidly building and testing multiple virtual machine configurations. The graph below shows the results of several batch submissions performed on an Amazon Machine Image created with a stock installation of DSpace 1.5.2. It uses a default EC2 machine instance with 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit*), 160 GB of instance storage, 32-bit Linux OS. This process was primarily designed to confirm the results of the ROAD Project test mentioned by Stewart Lewis. In Stewart's scenario, the entire 300,000 record submission took place at one time. We wanted to see if the problem exists even when the submission is broken up into several smaller blocks and takes place over a period of days. To reduce complexity in the data set, we replicated a single electronic journal article in PDF and JPG format, and submitted it along with it's descriptive metadata in batches of 10, 20, or 30 thousand. (*One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.) As you can see, the slope of this line is constant across multiple loads and multiple days. "Houston, we have a problem." Increasing Heap Space Editing the dsrun file in the [dspace_install]/bin directory does not increase load speed, in fact our numbers show a slight decrease. However the ability of the batch process to continue operating without an out of memory error increases, greatly improving the ability of the batch submission process to handle large runs of data at one time. #Allow user to specify java options through JAVA_OPTS variable if [ "$JAVA_OPTS" = "" ]; then #Default Java to use 256MB of memory JAVA_OPTS=-Xmx256m ItemImport Performance Obviously, if we are going to successfully test multiple millions of records we either need to fix the batch submission process, or find a way to place items in DSpace without using it. Fixing ItemImport - Performance Report We ran a batch submission of 4,328 items five times in a row and watched the 'most expensive' methods. All of these remained in roughly the same general range except org.dspace.core.Context.commit() which increased from a 124 ms average to 291, 386, 664, and 670.
<graph> Bypassing ItemImport - (SRB/iRODS, Database replacement, multiple assetstores. Hadoop, etc.) CPU Utilization - Amazon EC2 Instance Types Two tests reveal information about the CPU utilization of the DSpace submission process on multi-core machine instances. There is plenty of headroom for a high cpu, medium instance to handle additional import streams without slowing down the process. Small Instance 1.7 GB memory High-CPU Medium Instance 1.7 GB of memory More Power! Like any software development issue, there are two obvious solutions: fine tune the code, or throw more hardware at the problem. Performing an identical batch submission on these two machine types shows there is noticeable benefit to be gained from beefing up the system. Swapping PostgreSQL for Oracle Since DSpace allows for changing the database from Postgres to Oracle by changing a few configuration file parameters, and since Oracle 11g is AWS compatible, it makes sense swap the back edn to Oracle and run some tests using the Oracle SQL Tuner and the Automatic Tuning Optimizer. <data goes here> Search and Browse timings <Selenium test data> Cloud Computing Expenditures - pay-as-you-go computing For those interested in the costs we've encountered during this process, here's a slightly redacted billing page. You will notice during the last week we have performed nearly 240 hours of testing, during which time we created and destroyed more than a dozen machines. Making your own DSpace 1.5.2 instance in the Cloud The OhioLINK Amazon Machine Image is currently private. However the DSpace Foundation is testing a pre-built and pre-configured virtual application that promises to run on VMware, Xen, Parallels, Virtual Iron, Microsoft Virtualization, and Amazon EC2. Download the VM here. Testing Bibliography Here is a list of research we are attempting to build upon. If you know of any other articles/reports we should consider, please contact John Davison (john@ohiolink.edu). http://blog.stuartlewis.com/2009/01/19/dspace-at-a-third-of-a-million-items/ http://www.dspace.org/images/stories/ist2008_paper_submitted1.pdf http://www.digitalpreservation.gov/partners/aiht/high/DSpace_AIHT.pdf http://www.dlib.org/dlib/may09/marill/05marill.html http://fedora.fiz-karlsruhe.de/docs/Wiki.jsp?page=Main |






