Documentation

Shell commands

Most computational biologists who use analysis pipelines use shell commands, so LoamStream makes these particularly easy using the cmd method, which creates a job to be submitted to a shell (usually Bash).

cmd"echo 'Hello, World!'"

Using triple double quotes, we can use double quotes in commands:

cmd"""echo "Hello, World!" """

Triple double quotes also allow line breaks, which will be removed before the shell sees the command, so the following has the same effect as the previous one:

cmd"""
echo
"Hello, World!"
"""

Loam has variables, which can be inserted into a command before the shell sees it:

val year: Int = 2017
val greeting: String = "Hello"
cmd"""echo "$greeting, World, in the year $year!" """

LoamStream can infer types from values, for example, it understands that 2017 is an Int and that "Hello" is a String, so it will infer that the variables year and greeting are Int and String respectively, so the type annotation can be omitted:

val year = 2017
val greeting = "Hello"
cmd"""echo "$greeting, World, in the year $year!" """

If you want the shell to see a dollar sign ($), use double dollar ($$):

cmd"""echo "Your home directory is $$HOME" """

Stores

What is a store? It could be a file located by a file path, or it could be some other kind of storage located by a URI. Create a reference to a store using the store method:

val variants = store
val moreVariants = store
val phenotypes = store

To specify a file path, use the at method. If it is an input file (i.e. not created by LoamStream), also add the asInput method:

val variants = store.at("data/variants.txt").asInput
val associations = store.at("results/summary.txt")

If a store is inserted in a command string, LoamStream does two useful things:

  1. It inserts the file path or URI, properly escaped for Bash, in the command line. If no path or URI is known, a temporary path is created.
  2. It marks the store as an input or output to that command: if the has a a path or URI specified by a from method, or if the store has been used before in a command, then it is considered an input to the command where it is being inserted. Otherwise, it is considered output.

This allows LoamStream to decide automatically, which commands need to run before which other commands, and which commands can run in parallel, based on the input and output stores for each command.

For example, these:

val phaser = “formit”
val imputer = “impute42”
val data = store.at(“dataDir/myData.vcf”).asInput
val phased = store
val imputed = store.at(“outputDir/imputed.vcf”)
cmd”$phaser < $data > $phased”
cmd”””$imputer -in $phased -out $imputed”””

Will make LoamStream issue the following commands, in this order (assuming store phased is assigned /tmp/loamstream/file23.vcf):

formit < dataDir/myData.vcf > /tmp/loamstream/file23.vcf
impute42 -in /tmp/loamstream/file23.vcf -out outputDir/imputed.vcf

Input and output files can also be specified explicitly using the in and out methods:

val baseName = “bigDiverseSample”
val data = store.at($baseName + “.data”).asInput
val masks = store.at($baseName + “.masks”).asInput
val clusters = store.at($baseName + “.clusters”)
val log = store.at($baseName + “.log”)
cmd”clustah $baseName”.in(data, masks).out(clusters, log) 

This will run the following command

clustah bigDiverseSample

while assuming that the input files are bigDiverseSample.data and bigDiverseSample.masks and the output files are bigDiverseSample.log and bigDiverseSample.clusters.

Adjusting paths inserted in command line

Some apps require paths given as command line arguments to be different from the actual paths. Most commonly, a ".gz" suffix is omitted in the command line. This can be achieved in LoamStream as follows:

val raw = store.at(“data.vcf.gz”).asInput
val filtered = store.at(“data.filtered.vcf.gz”).asInput
val rawWithoutGz = raw - “.gz”
val filteredWithoutGz = filtered - “.gz”
cmd”filterx -in $rawWithoutGz -out $filteredWithoutGz”

This issues the following command

filterx -in data.vcf -out data.filtered.vcf

while assuming that the file names are actually data.vcf.gz and data.filtered.vcf.gz

Scattering and Gathering

To exploit parallel computation capacities, data is commonly broken down into parts (scattered), which are processed separately and then the results are joined (gathered) into a single store again. Of course, we could define each store and command separately manually, but LoamStream provides a simpler way.

In the following example, the input is one large VCF file, upon which we unleash one hundred commands, each processing a part of the input and producing an output file, and then a final command merges the hundred output files into one.

val data = store.at(“yuuugeDataset.vcf”).asInput
val nShards = 100
val shardSize = 10000000
val shardeds = (0 until nShards).map({ iShard =>
 val start = iShard*shardSize + 1; 
 val end = (iShard + 1) * shardSize
 val sharded = store.at(s“sharded.$start-$end.vcf”)
 cmd”transformer -in $data -range $start-$end -out $sharded”
 sharded
})
val gathered = store.at(“yuugeResults.vcf”)
cmd”vcf-concat $shardeds $gathered”

The until method creates a sequence of numbers from 0 to 99, and then for each number, a sharded store and a command are created. For example, the first sharded store is named sharded.1-10000000.vcf, and the first command looks like this:

transformer -in  yuuugeDataset.vcf -range 1-10000000 -out sharded.1-10000000.vcf

The variable shardeds is a sequence of stores. Inserting it into a command line inserts the file paths separated by blanks.

Running via Uger

To run commands on Uger, wrap them in a uger { ... } or ugerWith { ... } block. Commands wrapped in uger { ... } will run with default Uger settings (1 cpu and 1GB ram requested, 2 hours max run time, and the "broad" queue):

uger {
  cmd"foo --in $bar --out $baz"

  for(i <- 1 to 10) {
    cmd"blah --foo $i"
  }
}

Here, all (11) of the commands declared inside the uger { ... } block will run on Uger with the same (default) settings.

To run commands on Uger with custom, per-command settings, use ugerWith(...) { ... } for example:

ugerWith(cores = 4, mem = 6, maxRunTime = 8) {
  cmd"foo --in $bar --out $baz"

  for(i <- 1 to 10) {
    cmd"blah --foo $i"
  }
}

Here, the commands declared inside the ugerWith block will request from Uger 4 cores per job and 6 GB ram per core, and will run for at most 8 hours before being killed by Uger. The params cores, mem, and maxRunTime are all optional. If they're omitted, default values of 1, 1, and 2, respectively, will be used (the same defaults as for uger { ... } )

Loam is really Scala

Ok, we admit it: Loam is really Scala. More precisely, each Loam script is a fragment of Scala code, which is compiled by embedding it into an object definition that extends a special trait and contains a bunch of imports and implicit variables to enable the concise core Loam commands and to dispense with unnecessary clutter.

You don't know Scala? No worries. Loam is just Loam, easy to use.

You do know Scala? Awesome, use any Scala feature to make your pipeline definitions more powerful.