Loamstream has various configuration options, most of which are not directly exposed via the command line, but through a config file. This file's location is specified with the --conf command-line parameter, like --conf path/to/foo.conf. This file is expected to be in HOCON format with a LoamStream-specific structure. The config file can be named anything you want, but the convention is to name it loamstream.conf - that's what this documentation will use.
To get started running jobs locally, you can forgo a loamstream.conf file and use the defaults. To do anything more - use Uger, control how many times jobs are restarted if they fail, run Hail jobs at Google, etc - you'll need to provide a loamstream.conf file.
loamstream {
execution {
maxRunsPerJob = 42
outputDir = path/to/output/dir
maxWaitTimeForOutputs = "1 minute"
outputPollingFrequencyInHz = 42.42
}
uger {
workDir = /path/to/uger/work/dir
maxNumJobs = 2000
defaultCores = 2 //per job
defaultMemoryPerCore = 4 //in GB
defaultMaxRunTime = 3 //in hours
}
googlecloud {
hail {
jar = gs://uri/of/hail/jar
zip = gs://uri/of/hail/zip/file
scriptDir = path/to/dir/for/generated/hail/driver/scripts
}
gcloudBinary = /path/to/gcloud
gsutilBinary = path/to/gsutil
projectId = "some-project-id"
clusterId = "some-cluster-id-to-create"
credentialsFile = /path/to/credentials/file
zone = "..."
masterMachineType = "..."
masterBootDiskSize = 42, //in GB
numWorkers = 42
workerMachineType = "..."
workerBootDiskSize = 42 //in GB
numPreemptibleWorkers = 42
preemptibleWorkerBootDiskSize = 42 //in GB
imageVersion = "1.2.3.4.5.6"
scopes = "..."
properties = "..."
initializationActions = "..."
}
python {
binary = path/to/python/binary
scriptDir = dir/for/generated/python/scripts
}
r {
binary = path/to/R/binary
scriptDir = dir/for/generated/R/scripts
}
}
All of the keys under loamstream - execution, uger, googlecloud, python, and r - are optional, but omitting them will prevent some things from running. Commands won't be run on Uger without a valid uger { ... } block, for example.
loamstream.execution.maxRunsPerJob:
- The maximum number of times to run a job. If a job fails and this value is > 1, the job will be restarted until it succeeds, or its number of runs exceeds this value.
- Type: Int
- Optional; default is 4
loamstream.execution.outputDir:
- Directory to store jobs' standard output and standard error streams.
- Type: Path
- Optional; default is "<current directory>/job-outputs"
loamstream.execution.maxWaitTimeForOutputs:
- The maximum amount of time to wait for a job's outputs to appear before declaring the job a failure. This is particularly useful when running jobs on Uger, where files written by jobs running on other machines won't be immediately visible to the machine running LS.
- Type: HOCON duration
- Optional; default is 30 seconds
loamstream.execution.outputPollingFrequencyInHz:
- The frequency at which to poll the file system when waiting for job outputs to appear.
- Type: Double
- Optional; default is 0.1 (1 / 10s)
loamstream.uger.workDir:
- The directory in which Uger-related files (shell scripts generated by LS, job outupt streams temporarily used by Uger) will be written.
- Type: Path
- Required
loamstream.uger.maxNumJobs:
- The maximum number of Uger jobs to run concurrently.
- Type: Int
- Optional; default is 2000
loamstream.uger.defaultCores:
- The number of CPU cores to request for each job, if otherwise left unspecified
- Type: Int
- Optional; default is 1
loamstream.uger.defaultCores:
- The amount of memory to request per CPU core, in GB.
- Type: Int
- Optional; default is 1
loamstream.uger.defaultMaxRunTime:
- The maximum time, in hours, Uger should allow a job to run before killing it.
- Type: Int
- Optional; default is 2
loamstream.googlecloud.hail.jar:
- The URI of the Hail jar file to run at Google. Must be accessible via a gs:// URI from a Google Cloud Storage bucket.
- Type: URI (must be gs://)
- Required
loamstream.googlecloud.hail.zip:
- The URI of the Hail zip file to run at Google. Must be accessible via a gs:// URI from a Google Cloud Storage bucket.
- Type: URI (must be gs://)
- Required
loamstream.googlecloud.hail.scriptDir:
- The directory in which to write Python driver scripts generated by LS for Hail jobs.
- Type: Path
- Optional; default is the current directory.
loamstream.googlecloud.gcloudBinary:
- The path to the gcloud binary from the Google cloud SDK
- Type: Path
- Required (though see /humgen/diabetes/users/dig/loamstream/google-cloud-sdk/bin on the Broad FS.)
loamstream.googlecloud.gsutilBinary:
- The path to the gsutil binary from the Google cloud SDK
- Type: Path
- Required (though see /humgen/diabetes/users/dig/loamstream/google-cloud-sdk/bin on the Broad FS.)
loamstream.googlecloud.projectId:
- The ID of a Google Cloud project to use. Must be defined before running any jobs. This is used for billing purposes, and must be associated with a Broad cost object.
- Type: String
- Required
loamstream.googlecloud.clusterId:
- The name of the Google Cloud Dataproc cluster to create when running Hail jobs. LS will create and destroy this cl;uster, so it shouldn't be one that already exists.
- Type: String
- Required (see Google's naming conventions; cluster ids must contain only lowercase letters and dashes.)
loamstream.googlecloud.credentialsFile:
- Path to a JSON file containing default credentials when running Hail jobs at Google.
- Type: Path
- Required (though see /humgen/diabetes/users/dig/google_credential.json on the Broad FS.)
Note: See the Hail docs for details of and recommendations for Google Cloud dataproc cluster setup.
loamstream.googlecloud.zone:
- Google Cloud zone to create a cluster in.
- Type: String
- Optional; default is "us-central1-b"
loamstream.googlecloud.masterMachineType:
- Image type for the master machine in the dataproc cluster to be created by LS. Must be one of the machine types enumerated by Google, and viewable via the Google Cloud web UI.
- Type: String
- Optional, default is "n1-standard-1"
loamstream.googlecloud.masterBootDiskSize:
- Size of the boot disk, in GB, of the master machine in the dataproc cluster to be created by LS.
- Type: Int
- Optional, default is 20.
loamstream.googlecloud.numWorkers:
- Number of worker nodes in the dataproc cluster to be created by LS.
- Type: Int
- Optional, default is 2.
loamstream.googlecloud.workerMachineType:
- Image type for the worker machines in the dataproc cluster to be created by LS. Must be one of the machine types enumerated by Google, and viewable via the Google Cloud web UI.
- Type: String
- Optional, default is "n1-standard-1"
loamstream.googlecloud.workerBootDiskSize:
- Size of the boot disk, in GB, of each worker machine in the dataproc cluster to be created by LS.
- Type: Int
- Optional, default is 20.
loamstream.googlecloud.numPreemptibleWorkers:
- Number of preemptible worker nodes in the dataproc cluster to be created by LS.
- Type: Int
- Optional, default is 0.
loamstream.googlecloud.preemptibleWorkerBootDiskSize:
- Size of the boot disk, in GB, of each preemptible worker machine in the dataproc cluster to be created by LS.
- Type: Int
- Optional, default is 20.
loamstream.googlecloud.imageVersion:
- Version, in <major>.<minor>.<patch> format, of the VM image to use for machines that are part of the dataproc cluster to be created by LS.
- Type: String
- Optional, default is "1.1.49"
loamstream.googlecloud.scopes:
- Google Cloud auth scope
- Type: String (a URI)
- Optional, default is "https://www.googleapis.com/auth/cloud-platform"
loamstream.googlecloud.properties:
- Extra Spark properties to be specified when running Hail jobs.
- Type: String
- Optional, default is
- "spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,spark:spark.driver.memory=45g,spark:spark.driver.maxResultSize=30g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1"
loamstream.googlecloud.initializationActions:
- URI of a shell script to run on each machine in the dataproc cluster created by LS after each machien is created.
- Type: String (gs:// URI)
- Optional, default is "gs://loamstream/hail/hail-init.sh"
loamstream.python.binary:
- Path to the Python interpreter binary, to be used when specifying embedded Python scripts in .loam files.
- Type: Path
- Optional, default is "python". (Whatever is on the user's path is used.)
loamstream.python.scriptDir:
- Path of the directory into which embedded Python scripts specified in .loam files will be written.
- Type: Path
- Optional, default is the current working directory.
loamstream.r.binary:
- Path to the R binary, to be used when specifying embedded R scripts in .loam files.
- Type: Path
- Optional, default is "R". (Whatever is on the user's path is used.)
loamstream.r.scriptDir:
- Path of the directory into which embedded R scripts specified in .loam files will be written.
- Type: Path
- Optional, default is the current working directory.