2019-10-04 APR

Cloud for Computational Genomics


NB: We're in 5E030 today. Fifth floor, east tower, 'round the corner from the breakroom outside the elevator.

Eric, your show, I'll start gotomeeting and call in from Utah. I'll ask Geeta to help with communications

Since Jonas is out, I propose we radically deviate from the past few weeks and antagonize him sufficiently* motivate a discussion about the realities of porting our own analyses to a serverless or server-lite model. I'll use the terms "workflow" and "pipeline" interchangeably here, but they just mean a series of analysis steps that are dependent and that produce a complete analysis (e.g. alignment of reads or FASTQC).

I made some slides on this, which should be publicly available here: https://docs.google.com/presentation/d/1LByAomELPmoN744qEsiMepQk2zxonZHZ7TAzhEB0YP4/edit#slide=id.p1



* well played, that's the spirit :-D !! (Jonas)

Agenda:

  • All: B.Y.O.C. - bring your own code. What program or workflow do you use all the time that you think belongs in the cloud? What workflows do you think will be hard to port to the cloud?
      • Is the code under version control?
      • Is the code public?
      • Are the required inputs (e.g. reference genomes, dbSNP, BED filter files) available from public-facing repositories?
      • Do you think this code is easy to port to the cloud (yes/no/unsure), and why?
      • What language is the code written in? How do users interact with it (CLI, API, hard-coded values)?
      • Ben: Not an endorsement, but FWIW, lots of folks have ported (WDL) workflows to FireCloud/Terra.
        • We went through a WDL example on the call, because it's what I've used and can easily explain. The abstraction I lay out below (where our workflow is decomposed into Inputs, Parameters, Execution, Environment, and Outputs) is roughly how WDL / CWL / SnakeMake / NextFlow all work. Theory-wise, understanding the abstraction is important, but application-wise, WDL / CWL can get you up and running quickly!
          • Ben: IMO, using Snakemake and NextFlow to execute WDL/CWL spec'd workflows is more tractable for most bioinformaticians.
            • BTW -- Johannes Koster is coming to visit the NIH campus in early May (keep an eye on bioinformatics/data science lists for announcements!
              • DM Ben if you want to set up a meeting.
            • Evan Floden will likely also be coming to visit sometime soon!
  • Eric: Basics and important considerations when porting your code to the cloud. Let's dissect our workflows and consider how their components fit into the cloud model.
      • Let's abstract our analyses so that they have five parts: Inputs, Parameters, Environment, Execution, Outputs
        • What does this look like in a local compute environment?
        • What does this look like in the cloud?
        • How do we package our environment and execution (i.e. script) to run elsewhere?
      • Modularity
        • When building analyses, it is easier to work with modules (bite-sized pieces) rather than monolithic code chunks. How can we break our code up to make it easier to port to the cloud?
        • How can we break up our environment so that we can run our code without having to fully replicate our cluster environment on the cloud?
  • Ben: Inre question on top of (Eric's) slide 16, what if you embedded something like (not necessarily this exact thing): https://github.com/ncbi/ngs into the first tool (mapper in your workflow)? You could then stream data directly from a variety of locations into your compute node. In a perfect world, you could do this in parallel in a number of cloud environments, and then integrate for something like joint variant calling or TPM-based RNAseq. (Eric's response: this is super cool!!! How easy is it to hit raw REST endpoints, and can the API also work on local files?)
    • The REST situation is going to change a lot with migration of files to cloud storage. That said, inre local + remote, check out magicBLAST. It can operate on sra and local fastq with ease (so can HISAT2)