2022-04-22 APR

Wellcome FAIRiners ...

... from Wednesday’s Workshop; Please see email with useful links and, highly recommended, stay in touch via this forum's gitter https://gitter.im/episphere/Fair.

Melanoma tiling for AI update

Bhaumik ...

Data formats for distributed genomics.

I can see why people handling VCF, MAF etc would find it convenient, but if you’re trying to avoid hosting a query engine, and just store a file in Box ... I understand an argument for the managed service BigQuery instead of having to host a query engine ourselves. Am I missing something?

Wendy Wong @shukwong

Yes, it’s just that a lot of the bigdata genomics packages work really well with parquet, e.g. https://github.com/bigdatagenomics/adam https://glow.readthedocs.io/en/latest/ and https://hail.is/ (not a fan of hail)

Parquet works well with AWS S3 (can directly do S3 select) and can imported into AWS Athena (where you index on a column for quick search)

DNANexus VCF Loader also ingests VCF into Parquet https://documentation.dnanexus.com/user/spark/example-applications/vcf-loader

in the past we also built our own genomics data db using Apache Impala that uses parquet, works reasonably well. I have no opinion on the technology itself but there are just a lot of genomics tools that built around parquet files

Jeya Balaji Balasubramanian @jeyabbalas

Wow, Dask supports Parquet :O. I am starting to not mind it anymore :D. Note that Dask also supports HDF5, Zarr, and Tensorflow Records.

While Wendy sheds light on it, I think since Dask allows range requests, my guess is that it does support it on the server side. So far, only Zarr seems ideally designed for browser applications.

Jonas Almeida @jonasalmeida

Note all of these formats and engines at some level descend from BigTable, to take advantage of map-reduce processing …
The connection to Genome analysis is a key enabler of NextGen sequencing.

Key practical question from the distributed Web computing types: can we do well enough by just storing and governing static columnar files in any of these formats in a consumer-facing middle layer like Box or GDrive, or do we need a server-side engine, like BigQuery?

Wendy Wong @shukwong

i agree with Jeya… and I think parquet may not work that well with file sitting in box to do range request. the way to speed up the search is to index on a column and the system divides the file in multiple parquet files, the user searches the folder as if it’s one big file. not so convenient to index on a different column to allow fast search

Jonas Almeida @jonasalmeida

so last question for the last 2 mins before live FAIR starts - since we really don’t want to be in the business of creating indexes in static files (we’d be recreating database systems) are we better of by relying on bq ?

Page updated

Google Sites

Report abuse

2022-04-22 APR

Wellcome FAIRiners ...

Melanoma tiling for AI update

Data formats for distributed genomics.

Wendy Wong @shukwong

Jeya Balaji Balasubramanian @jeyabbalas

Jonas Almeida @jonasalmeida

Wendy Wong @shukwong

Jonas Almeida @jonasalmeida

episphere.github.io/fair