Homework 3:

Parallelizing Genome Assembly

Introduction

This assignment is an introduction to writing distributed memory programs for applications with irregular communication patterns. Here, we'll be implementing a distributed hash table using UPC++. We'll use our distributed hash table to evaluate one stage of a de novo genome assembly pipeline.

Due Date: Wednesday, April 3

For Remote Students

Dear remote students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!

Background

Here is some background on de novo genome DNA assembly. Strictly speaking, you don't need to understand all of this to complete the assignment, but it is interesting. DNA assembly is the determination of the precise order of the nucleotides in a DNA molecule. A DNA molecule consists of four different bases, namely, adenine (A), guanine (G), cytosine (C), and thymine (T). For the purposes of this assignment we consider a DNA molecule to be a DNA strand, e.g. CTAGGAGCT (although in reality the number of bases for an actual DNA strand is much larger -- on the order of billions). De novo genome assembly is the assembly of strands of DNA in new genomes, without the use of a reference genome.

Unfortunately we can not read the whole DNA strand in one go and therefore researchers have invented alternative methods to construct a genome. One such method is called shotgun sequencing. In this method, many copies of the original DNA strand are made. Each copy is then fragmented randomly into pieces. We cannot control how the copies are fragmented; they are randomly split into short contiguous fragments. Next, we read each small fragment and each result is called a short read. Finally, we use all the short reads to reconstruct the original DNA strand. Figure 1 shows the process.

Figure 1: Shotgun sequencing

source: http://people.mpi-inf.mpg.de/~sven/images/assembly.png

The short reads returned from shotgun sequencing are significantly shorter than the actual DNA strand from which they were generated, so we must somehow combine them to create one long strand. In addition, these short reads include sequencing errors, which makes assembling these fragments more difficult. There are methods to preprocess the short reads and remove the errors; however, these methods are outside of the scope of this homework assignment and we refer the interested reader to [1].

The output of this first preprocessing stage of the de novo genome assembly pipeline is a set of unique DNA sequence fragments of length k, which we call k-mers. K-mers represent error-free DNA segments. Each k-mer is associated with a forward and backward extension. A k-mer's backward extension is the base that precedes the k-mer in the DNA sequence, and its forward extension is the base that follows the k-mer in the DNA sequence.

Given a set of unique k-mers, we can build a special graph which is a compact representation of the connectivity among these k-mers. This special type of graph is called a de Bruijn graph, and in general a de Bruijn graph is used to represent overlaps between sequences of symbols.

In our particular case, the vertices of our de Bruijn graph are k-mers. The edges of our de Bruijn graph represent k-mers which overlap by k-1 bases. Since DNA is linear, each vertex in our de Bruijn graph is guaranteed to have at most two neighbors. Additionally, each vertex in the de Bruijn graph is unique since the k-mers are unique. An example of such a de Bruijn graph is shown in Figure 2, where we illustrate a graph with k = 3. In this particular graph, nodes ATC and TCT are connected with an edge because they overlap in 2 bases (TC).

Figure 2: A de Bruijn graph with k = 3. We can identify three connected components (contigs) and six start nodes: GAT, TGA, AAT, TGC, AAC, CCG. The contigs represented in this graph via connected components are: GATCTGA, AACCG and AATGC.

After building the de Bruijn graph, we can traverse it and find connected components called contigs. Note that these connected components should be linear chains due to the nature of DNA. Contigs are error-free DNA sequences significantly longer than the original reads. In later stages of the assembly pipeline, contigs will be linked together by leveraging information from the original reads to eventually produce a set of scaffolds which constitute the final result. In this assignment, we will be focusing on the parallel construction and traversal of the de Bruijn graph of k-mers (with a few assumptions and simplifications) and we refer the interested reader to [1] for more information on other stages of the pipeline.

Problem Statement

The input to our problem is a set of unique k-mers, which are sequences of DNA bases of length k. Each k-mer is associated with a forward extension, which is the base which precedes the k-mer in the DNA sequence, and a backward extension, which is the base that follows the k-mer in the DNA sequence.

Our k-mers are guaranteed to be unique and to overlap one another by exactly k-1 bases. Our goal is to traverse these k-mers to produce a series of contigs, which are longer contiguous segments of DNA. For a particular k-mer, we can determine the next k-mer in the sequence by taking the last k-1 bases and appending the forward extension. In Python slicing notation, this is next_kmer = kmer[1:] + forward_extension.

Some k-mers are special, in that they have the special forward or backward extension F (which is not one of the regular A,T,C, or G bases we'd expect in DNA). The special base F indicates the beginning or end of a read. We call k-mers with the backward extension F start k-mers, since they are the k-mers which start a contig. Similarly, k-mers with the forward extension F are end k-mers, since they end a contig.

K-Mer Traversal Algorithm

To generate contigs, we will begin by inserting all the k-mers into a hash table. While we are inserting the k-mers, we will check for any special start k-mers with backward extension F and append them to a list of start k-mers. After building the hash table and identifying our start k-mers, we can generate a contig for each start k-mer by searching for the next k-mer in the sequence until we find an end k-mer with forward extension F, which signifies the end of the contig.

Here's what this looks like in Python pseudocode.

start_kmers = list()

kmer_hash = dict()

kmers = muh_kmer_dataset()

# Build hash table, collect start k-mers.

for kmer in kmers:

    # Suppose kmer.kmer is the k-mer string and

    # kmer is an object representing the k-mer,

    # including the forward and backward extension.

    kmer_hash[kmer.kmer] = kmer

    if kmer.backward_ext == 'F':

        start_kmers.append(kmer)

contigs = list()

# Traverse contigs, each starting with a start k-mer.

for start_kmer in start_kmers:

    contig = list()

    contig.append(start_kmer)

    while contig[-1].forward_ext != 'F':

        next_kmer = contig[-1].next_kmer()

        contig.append(kmer_hash[next_kmer])

    contigs.append(contig)

Your Job

In the starter code, you will find a serial implementation of the contig generation stage of the de novo genome assembly pipeline. Your job is to parallelize this code using UPC++.

Starter Code

You can download the starter code here: https://cs.berkeley.edu/~brock/cs267/hw/cs267_hw3_2019.tar.gz

Here's a rundown of the files you'll find in the starter code tarball. kmer_hash.cpp and hash_map.hpp are the main files you'll be modifying. You won't necessarily need to understand how the other source files work, just how to use the kmer_pair and pkmer_t data structures.

cs267_hw3_2018 (a folder)

  |----Makefile         - Builds the kmer_hash binary, which does contig generation.

  |----kmer_hash.cpp    - The main kmer hash file.  You'll need to modify this a bit.

  |----hash_map.hpp     - The hash table.  This is the main data structure you'll be parallelizing.

  |----kmer_t.hpp       - Data structure which defines a kmer_pair, which is a k-mer and its extensions.

  |----pkmer_t.hpp      - Data structure which contains a k-mer as a packed binary string.

  |----packing.hpp      - Code that packs a k-mer character string into a binary string.

  |----read_kmers.hpp   - Code which reads k-mers from a file.

Implementing a Hash Table

In hash_map.cpp, you'll find a serial implementation of a hash table using open addressing. If this sounds unfamiliar to you, go ahead and skim the Wikipedia pages on hash tables and open addressing. A hash table uses a hash function to decide at which location in an array to store an item. This allows fast random access using a key value. When two keys produce the same hash value, there is a conflict, since we cannot store two items at the same location in an array. One way to resolve these collisions is open addressing, which resolves conflicts by probing the array with some deterministic probing strategy until a free spot is found. The starter code uses open addressing with linear probing, which means we just try the next slot in the array until we find a free slot for our item.

In Distributed Memory

In order to parallelize the code, you'll need to modify the data and used data members of the hash map to refer to distributed objects and then modify the insert() and find() methods accordingly. A common way to implement a distributed array in UPC++ is to create a C++ vector of upcxx::global_ptrs that point to arrays allocated in the shared segment on each rank. You can then view the distributed array as one logically contiguous array and use arithmetic to write to the appropriate location in shared memory.

Note:

It is possible to write a trivial distributed hash table in UPC++ by using remote procedure calls to insert() or find() an item in a local std::unordered_map or std::map. Do not submit this as your parallel solution, or you will lose credit.

Building and Running the Code

Note that in order to compile and run the code, you'll first need to load the appropriate UPC++ modules.

module swap PrgEnv-intel PrgEnv-gnu

module load upcxx

You should then be able to build and run the starter code. Note that we're now recommending you use an interactive session to run a single command, since this both has a low turnaround time and avoids burning compute hours unnecessarily by holding an interactive session.

[demmel@cori10 cs267_hw3_2018]$ make

CC kmer_hash.cpp -o kmer_hash `upcxx-meta PPFLAGS` `upcxx-meta LDFLAGS` `upcxx-meta LIBFLAGS`

[demmel@cori10 cs267_hw3_2018]$ salloc -N 1 -A mp309 -t 10:00 -q debug --qos=interactive -C haswell srun -N 1 -n 1 ./kmer_hash [my_dataset]

[my_dataset] is the dataset of k-mers you'd like to assemble contigs for. You can find k-mer datasets in /project/projectdirs/mp309/cs267-spr2018/hw3-datasets.

Here's a rundown of the various datasets.

  |----human-chr14-synthetic.txt  - 51-mers, A large dataset, based on the human chromosome 14.

  |----large.txt                  - 51-mers, A slightly smaller dataset.

  |----test.txt                   - 19-mers, A smaller dataset for testing.

  |----smaller (a folder, smaller datasets only for debugging)

       |----small.txt             - 19-mers, A smaller dataset for testing (1000 contigs)

       |----little.txt            - 19-mers, A little smaller dataset for testing (100 contigs)

       |----verysmall.txt         - 19-mers, A very small dataset for testing (10 contigs)

       |----tiny.txt              - 19-mers, A tiny dataset for testing (1 contig)

Recompiling for Different Sizes of K-Mers

Note that to run on the larger datasets, human-chr14-synthetic.txt and large.txt, you'll need to recompile your code to handle 51-mers instead of 19-mers. Simply modify the KMER_LEN macro at the top of packing.hpp, then recompile. You'll need to do the same thing in reverse if you later want to switch back to 19-mers. If you try to run your code on a dataset with the wrong kind of k-mer, it will automatically exit with an error telling you to modify packing.hpp and recompile.

Testing Correctness

You'll need to test that your parallel code is correct. To do this, run your parallel code with the optional test parameter. This will cause each process to print out its generated contigs to a file test_[rank].dat where [rank] is the process's rank number. To compare, just combine and sort the output files, then compare the result to the reference solutions located in the same directories as the input files.

[demmel@cori10 cs267_hw3_2018]$ salloc -N 1 -A mp309 -t 10:00 -q debug --qos=interactive -C haswell srun -N 1 -n 32 ./kmer_hash /global/project/projectdirs/mp309/cs267-spr2018/hw3-datasets/test.txt test

[demmel@cori10 cs267_hw3_2018]$ cat test*.dat | sort > my_solution.txt

[demmel@cori10 cs267_hw3_2018]$ diff my_solution.txt /global/project/projectdirs/mp309/cs267-spr2018/hw3-datasets/test_solution.txt

If diff prints a bunch of output telling you differences between the files, there's an issue with your code. If it's quiet, your code is correct. You can also use tools like md5sum or shasum to check whether you solution is correct. Note that you should remove your output files (rm test*.dat) between test runs.

Optimizing File I/O

File I/O between the project directory (where we've stashed the datasets) and the compute nodes can be quite slow, particularly for large files. We recommend you (1) copy the datasets to a folder in your scratch space and (2) set the folder with your datasets to be striped. Striping will spread out the files in your directory so that different parts are located on different physical hard disks (in Lustre file system lingo, these are called "OSTs", or object storage targets). This can sometimes slow down serial I/O slightly, but will significantly increase I/O performance when reading one file from multiple nodes.

[demmel@cori10 cs267_hw3_2018]$ cd $SCRATCH

[demmel@cori10 demmel]$ mkdir my_datasets

[demmel@cori10 demmel]$ lfs setstripe my_datasets -c 72 -s 8M

[demmel@cori10 demmel]$ cd my_datasets

[demmel@cori10 my_datasets]$ cp /project/projectdirs/mp309/cs267-spr2018/hw3-datasets/* .

You can then call kmer_hash with the datasets in $SCRATCH/my_datasets, and your runs should be somewhat faster. Note that this is optional, and will not improve your timed performance, since we don't time file I/O in kmer_hash. However, your runs will finish faster. If you like, you can read more about file system performance here.

Scaling Experiments

Once your parallel code is working and optimized, run scaling experiments to see how good your performance is. Use the test.txt, largeinput.txt, and human-chr14-synthetic.txt datasets.

Run single-node experiments, using 1, 2, 4, 8, 16, 32, and 64 processes. Graph your parallel program's runtime across the different datasets as you increase the number of processes. Also graph your strong scaling efficiency. Are there any odd jumps or trends in your graphs? Did some datasets perform better than others? Include these graphs and an analysis in your report.
Run multi-node experiments, using 1, 2, 4, and 8 nodes with 32 ranks per node. Graph your parallel program's runtime across the different datasets as you increase the number of nodes. Also graph your strong scaling efficiency. Are there any odd jumps or trends in your graphs? Did some datasets perform better than others? Include these graphs and an analysis in your report.

Note

Note that when running larger datasets on a single node, you may need to increase your UPC++ shared memory segment size. This allows you to allocate more shared arrays and objects. You can do this by setting the UPCXX_SEGMENT_MB environment variable to a larger value. You might also need to increase GASNET_MAX_SEGSIZE.

Submission

Your submission should consist of a tarball with your parallel solution to the contig generation problem as well as a report PDF and a text file containing your team members. Your submission should have a filename of the format LastName1_LastName2_hw3.tar.gz. Here's an example of what we might expect when opening your submission.

Demmel_Yelick_Buluc_hw3 (a folder)

   |--source files

   |--Makefile

   |--report.pdf

   |--members.txt

   |--bonus (a folder, optional--only include if you have code for the extra credit parts)

Report

The report should consist of the following points.

Graphs and discussion of the scaling experiments listed above.
Describe your implementation--how is your hash table organized? What data structures do you use?
Any optimizations you tried and how they impacted performance.
A discussion of how using UPC++ implemented your design choices. How might you have implemented this if you were using MPI? If you were using OpenMP?

Extra Credit

Include an Extra Credit section in your report if you complete any extra credit items. If you choose to finish extra credit problems which have code associated, include the code in a bonus folder in your submission tarball.

Short Answer - Contig F Extensions

What if we did not have the base F marking the beginning and end of contigs? Suppose instead we had a random base as an extension whose corresponding k-mer did not exist in the dataset. How would this affect the contig generation algorithm? How would it affect your implementation?

Short Answer - Load Balancing

Suppose start k-mers are unevenly distributed throughout the dataset, and this creates a load balancing problem. How might you deal with this?

Short Answer - Parallelizability of Graph Traversal

Suppose that the underlying de Bruijn graph consists of p connected components (p is the number of processors) whose length (number of vertices) follows a power law distribution. How well does your parallel code perform on this input? Can you do better (algorithmically)? What if it consists of a single connected component?

Experiment - KNL

Recompile your code for Cori Phase II, which has Intel Xeon Phi KNL processors, and run your scaling experiments again. How does the performance compare to that on Cori Phase I, which has Intel Haswell processors? How might your results be influenced by the CPUs or the networking hardware?

Experiment - RPC Hash Table

It's possible to implement a very trivial hash table in UPC++ which uses remote procedure calls to trigger an insert() or find() operation in a local std::unordered_map on a remote process. Implement this. How do the results compare to the hash table that you implemented? Discuss why the performance is or is not different. Does the way you're using the network affect your results? Any differences on Cori Phase II, which has Intel Xeon Phi KNL processors?

Resources

UPC++

First, we recommend you look at the UPC++ Programmer's Guide.
For more niche usage, the UPC++ Specification is also useful.
CS 267 Lecture Notes on PGAS Languages and UPC++.
For UPC++-specific questions and problems, try the UPC++ Google Group.

Genome Assembly

For general information about Evangelos Georganas' work on parallel genome assembly, upon which this homework is based, the first few chapters of his thesis are excellent.
For information about the production version of this phase of the genome assembly pipeline, see this SC14 paper.

Parallel Programming Languages

You can read a high-level overview of UPC++ in this extended abstract.
Titanium, UPC, and UPC++ are all parallel languages developed here in Berkeley.
I find this early paper on Titanium quite interesting, as there are a lot of pioneering features for such an early project. Smart pointers, in distributed memory, in 1998!

References

[1] Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, and Daniel S. Rokhsar. Meraculous: De novo genome assembly with short paired-end reads. PLoS ONE, 6(8):e23501, 08 2011.