bit.ly/cloud4bio - 2019-04-10 APR

All: B.Y.O.C. - bring your own code. What program or workflow do you use all the time that you think belongs in the cloud? What workflows do you think will be hard to port to the cloud?

Is the code under version control?
Is the code public?
Are the required inputs (e.g. reference genomes, dbSNP, BED filter files) available from public-facing repositories?
Do you think this code is easy to port to the cloud (yes/no/unsure), and why?
What language is the code written in? How do users interact with it (CLI, API, hard-coded values)?
Ben: Not an endorsement, but FWIW, lots of folks have ported (WDL) workflows to FireCloud/Terra.
- We went through a WDL example on the call, because it's what I've used and can easily explain. The abstraction I lay out below (where our workflow is decomposed into Inputs, Parameters, Execution, Environment, and Outputs) is roughly how WDL / CWL / SnakeMake / NextFlow all work. Theory-wise, understanding the abstraction is important, but application-wise, WDL / CWL can get you up and running quickly!
  - Ben: IMO, using Snakemake and NextFlow to execute WDL/CWL spec'd workflows is more tractable for most bioinformaticians.
    - BTW -- Johannes Koster is coming to visit the NIH campus in early May (keep an eye on bioinformatics/data science lists for announcements!
      - DM Ben if you want to set up a meeting.
    - Evan Floden will likely also be coming to visit sometime soon!

Eric: Basics and important considerations when porting your code to the cloud. Let's dissect our workflows and consider how their components fit into the cloud model.

Let's abstract our analyses so that they have five parts: Inputs, Parameters, Environment, Execution, Outputs
- What does this look like in a local compute environment?
- What does this look like in the cloud?
- How do we package our environment and execution (i.e. script) to run elsewhere?
  - How does serverless help us here?
  - Containers and Images: bundles of Environment and Execution
  - Geeta asked: "How do we get closer to serverless without having to learn (a bunch of) JavaScript or invest lots and lots of development time?"
    - code-wrapping: We can take existing code (in C, C++, python, etc) and wrap it inside JavaScript. This way, we gain the portability and packaging benefits, but we only need to write 1/10 or 1/10000 the amount of JavaScript. Examples:
Modularity
- When building analyses, it is easier to work with modules (bite-sized pieces) rather than monolithic code chunks. How can we break our code up to make it easier to port to the cloud?
- How can we break up our environment so that we can run our code without having to fully replicate our cluster environment on the cloud?

Ben: Inre question on top of (Eric's) slide 16, what if you embedded something like (not necessarily this exact thing): https://github.com/ncbi/ngs into the first tool (mapper in your workflow)? You could then stream data directly from a variety of locations into your compute node. In a perfect world, you could do this in parallel in a number of cloud environments, and then integrate for something like joint variant calling or TPM-based RNAseq. (Eric's response: this is super cool!!! How easy is it to hit raw REST endpoints, and can the API also work on local files?)

The REST situation is going to change a lot with migration of files to cloud storage. That said, inre local + remote, check out magicBLAST. It can operate on sra and local fastq with ease (so can HISAT2)

2019-10-04 APR

Cloud for Computational Genomics