... from Wednesday’s Workshop; Please see email with useful links and, highly recommended, stay in touch via this forum's gitter https://gitter.im/episphere/Fair.
Bhaumik ...
I can see why people handling VCF, MAF etc would find it convenient, but if you’re trying to avoid hosting a query engine, and just store a file in Box ... I understand an argument for the managed service BigQuery instead of having to host a query engine ourselves. Am I missing something?
Yes, it’s just that a lot of the bigdata genomics packages work really well with parquet, e.g. https://github.com/bigdatagenomics/adam https://glow.readthedocs.io/en/latest/ and https://hail.is/ (not a fan of hail)
Parquet works well with AWS S3 (can directly do S3 select) and can imported into AWS Athena (where you index on a column for quick search)
DNANexus VCF Loader also ingests VCF into Parquet https://documentation.dnanexus.com/user/spark/example-applications/vcf-loader
in the past we also built our own genomics data db using Apache Impala that uses parquet, works reasonably well. I have no opinion on the technology itself but there are just a lot of genomics tools that built around parquet files
Wow, Dask supports Parquet :O. I am starting to not mind it anymore :D. Note that Dask also supports HDF5, Zarr, and Tensorflow Records.
While Wendy sheds light on it, I think since Dask allows range requests, my guess is that it does support it on the server side. So far, only Zarr seems ideally designed for browser applications.
Note all of these formats and engines at some level descend from BigTable, to take advantage of map-reduce processing …
The connection to Genome analysis is a key enabler of NextGen sequencing.
Key practical question from the distributed Web computing types: can we do well enough by just storing and governing static columnar files in any of these formats in a consumer-facing middle layer like Box or GDrive, or do we need a server-side engine, like BigQuery?
i agree with Jeya… and I think parquet may not work that well with file sitting in box to do range request. the way to speed up the search is to index on a column and the system divides the file in multiple parquet files, the user searches the folder as if it’s one big file. not so convenient to index on a different column to allow fast search
so last question for the last 2 mins before live FAIR starts - since we really don’t want to be in the business of creating indexes in static files (we’d be recreating database systems) are we better of by relying on bq ?