Configuration

loamstream.conf

Loamstream has various configuration options, most of which are not directly exposed via the command line, but through a config file. This file's location is specified with the --conf command-line parameter, like --conf path/to/foo.conf. This file is expected to be in HOCON format with a LoamStream-specific structure. The config file can be named anything you want, but the convention is to name it loamstream.conf - that's what this documentation will use.

Getting Started

To get started running jobs locally, you can forgo a loamstream.conf file and use the defaults. To do anything more - use Uger, control how many times jobs are restarted if they fail, run Hail jobs at Google, etc - you'll need to provide a loamstream.conf file.

Example loamstream.conf

loamstream {

  execution {

    maxRunsPerJob = 42

    outputDir = path/to/output/dir

    maxWaitTimeForOutputs = "1 minute"

    outputPollingFrequencyInHz = 42.42

  uger {

    workDir = /path/to/uger/work/dir

    maxNumJobs = 2000

    defaultCores = 2 //per job

    defaultMemoryPerCore = 4 //in GB

    defaultMaxRunTime = 3 //in hours

  googlecloud {

    hail {

      jar = gs://uri/of/hail/jar

      zip = gs://uri/of/hail/zip/file

      scriptDir = path/to/dir/for/generated/hail/driver/scripts

    gcloudBinary = /path/to/gcloud

    gsutilBinary = path/to/gsutil

    projectId = "some-project-id"

    clusterId = "some-cluster-id-to-create"

    credentialsFile = /path/to/credentials/file

    zone = "..."

    masterMachineType = "..."

    masterBootDiskSize = 42, //in GB

    numWorkers = 42

    workerMachineType = "..."

    workerBootDiskSize = 42 //in GB

    numPreemptibleWorkers = 42

    preemptibleWorkerBootDiskSize = 42 //in GB

    imageVersion = "1.2.3.4.5.6"

    scopes = "..."

    properties = "..."

    initializationActions = "..."

  python {

    binary = path/to/python/binary

    scriptDir = dir/for/generated/python/scripts

r {

    binary = path/to/R/binary

    scriptDir = dir/for/generated/R/scripts

All of the keys under loamstream - execution, uger, googlecloud, python, and r - are optional, but omitting them will prevent some things from running. Commands won't be run on Uger without a valid uger { ... } block, for example.

Reference

loamstream.execution.maxRunsPerJob:

The maximum number of times to run a job. If a job fails and this value is > 1, the job will be restarted until it succeeds, or its number of runs exceeds this value.
Type: Int
Optional; default is 4

loamstream.execution.outputDir:

Directory to store jobs' standard output and standard error streams.
Type: Path
Optional; default is "<current directory>/job-outputs"

loamstream.execution.maxWaitTimeForOutputs:

The maximum amount of time to wait for a job's outputs to appear before declaring the job a failure. This is particularly useful when running jobs on Uger, where files written by jobs running on other machines won't be immediately visible to the machine running LS.
Type: HOCON duration
Optional; default is 30 seconds

loamstream.execution.outputPollingFrequencyInHz:

The frequency at which to poll the file system when waiting for job outputs to appear.
Type: Double
Optional; default is 0.1 (1 / 10s)

loamstream.uger.workDir:

The directory in which Uger-related files (shell scripts generated by LS, job outupt streams temporarily used by Uger) will be written.
Type: Path
Required

loamstream.uger.maxNumJobs:

The maximum number of Uger jobs to run concurrently.
Type: Int
Optional; default is 2000

loamstream.uger.defaultCores:

The number of CPU cores to request for each job, if otherwise left unspecified
Type: Int
Optional; default is 1

loamstream.uger.defaultCores:

The amount of memory to request per CPU core, in GB.
Type: Int
Optional; default is 1

loamstream.uger.defaultMaxRunTime:

The maximum time, in hours, Uger should allow a job to run before killing it.
Type: Int
Optional; default is 2

loamstream.googlecloud.hail.jar:

The URI of the Hail jar file to run at Google. Must be accessible via a gs:// URI from a Google Cloud Storage bucket.
Type: URI (must be gs://)
Required

loamstream.googlecloud.hail.zip:

The URI of the Hail zip file to run at Google. Must be accessible via a gs:// URI from a Google Cloud Storage bucket.
Type: URI (must be gs://)
Required

loamstream.googlecloud.hail.scriptDir:

The directory in which to write Python driver scripts generated by LS for Hail jobs.
Type: Path
Optional; default is the current directory.

loamstream.googlecloud.gcloudBinary:

The path to the gcloud binary from the Google cloud SDK
Type: Path
Required (though see /humgen/diabetes/users/dig/loamstream/google-cloud-sdk/bin on the Broad FS.)

loamstream.googlecloud.gsutilBinary:

The path to the gsutil binary from the Google cloud SDK
Type: Path
Required (though see /humgen/diabetes/users/dig/loamstream/google-cloud-sdk/bin on the Broad FS.)

loamstream.googlecloud.projectId:

The ID of a Google Cloud project to use. Must be defined before running any jobs. This is used for billing purposes, and must be associated with a Broad cost object.
Type: String
Required

loamstream.googlecloud.clusterId:

The name of the Google Cloud Dataproc cluster to create when running Hail jobs. LS will create and destroy this cl;uster, so it shouldn't be one that already exists.
Type: String
Required (see Google's naming conventions; cluster ids must contain only lowercase letters and dashes.)

loamstream.googlecloud.credentialsFile:

Path to a JSON file containing default credentials when running Hail jobs at Google.
Type: Path
Required (though see /humgen/diabetes/users/dig/google_credential.json on the Broad FS.)

Note: See the Hail docs for details of and recommendations for Google Cloud dataproc cluster setup.

loamstream.googlecloud.zone:

Google Cloud zone to create a cluster in.
Type: String
Optional; default is "us-central1-b"

loamstream.googlecloud.masterMachineType:

Image type for the master machine in the dataproc cluster to be created by LS. Must be one of the machine types enumerated by Google, and viewable via the Google Cloud web UI.
Type: String
Optional, default is "n1-standard-1"

loamstream.googlecloud.masterBootDiskSize:

Size of the boot disk, in GB, of the master machine in the dataproc cluster to be created by LS.
Type: Int
Optional, default is 20.

loamstream.googlecloud.numWorkers:

Number of worker nodes in the dataproc cluster to be created by LS.
Type: Int
Optional, default is 2.

loamstream.googlecloud.workerMachineType:

Image type for the worker machines in the dataproc cluster to be created by LS. Must be one of the machine types enumerated by Google, and viewable via the Google Cloud web UI.
Type: String
Optional, default is "n1-standard-1"

loamstream.googlecloud.workerBootDiskSize:

Size of the boot disk, in GB, of each worker machine in the dataproc cluster to be created by LS.
Type: Int
Optional, default is 20.

loamstream.googlecloud.numPreemptibleWorkers:

Number of preemptible worker nodes in the dataproc cluster to be created by LS.
Type: Int
Optional, default is 0.

loamstream.googlecloud.preemptibleWorkerBootDiskSize:

Size of the boot disk, in GB, of each preemptible worker machine in the dataproc cluster to be created by LS.
Type: Int
Optional, default is 20.

loamstream.googlecloud.imageVersion:

Version, in <major>.<minor>.<patch> format, of the VM image to use for machines that are part of the dataproc cluster to be created by LS.
Type: String
Optional, default is "1.1.49"

loamstream.googlecloud.scopes:

Google Cloud auth scope
Type: String (a URI)
Optional, default is "https://www.googleapis.com/auth/cloud-platform"

loamstream.googlecloud.properties:

Extra Spark properties to be specified when running Hail jobs.
Type: String
Optional, default is
"spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=45g,spark:spark.driver.maxResultSize=30g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1"

loamstream.googlecloud.initializationActions:

URI of a shell script to run on each machine in the dataproc cluster created by LS after each machien is created.
Type: String (gs:// URI)
Optional, default is "gs://loamstream/hail/hail-init.sh"

loamstream.python.binary:

Path to the Python interpreter binary, to be used when specifying embedded Python scripts in .loam files.
Type: Path
Optional, default is "python". (Whatever is on the user's path is used.)

loamstream.python.scriptDir:

Path of the directory into which embedded Python scripts specified in .loam files will be written.
Type: Path
Optional, default is the current working directory.

loamstream.r.binary:

Path to the R binary, to be used when specifying embedded R scripts in .loam files.
Type: Path
Optional, default is "R". (Whatever is on the user's path is used.)

loamstream.r.scriptDir:

Path of the directory into which embedded R scripts specified in .loam files will be written.
Type: Path
Optional, default is the current working directory.

Report abuse