Quick Start Part 3 Workspace gt Data

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-01-07

Within the workspace, we use a data model to organize data and metadata. The data model includes predefined entity types (see below), relationships, and attributes. So it is a formal description of what types of data entities you are working with and how they relate to each other.

The FireCloud data model supports the following basic entity types:

Participant - a person enrolled in a study.
Sample- a biological sample collected from a participant.
Pair- represents a pair of biological samples collected from a participant.
Set- a collection of participants, samples, or pairs.

Let's walk through an example. A participant enrolled in a study provides several samples, from which data is generated and stored under the sample base type. (e.g. gene panel, WGS, WES, and Microbiome seq). Two of these samples (blood and tumor tissue) are used to generate exome sequence data, which together will form a Tumor-Normal pair.

The workspace stores all this information in different tables based on their entity type (participant, sample, etc.) The Data tables for the diagram shown above would look something like this:

You’ll notice that the sample and pair tables tie back to the participant ID, coordinating all three tables worth of data based on the participant. The sample set table is a collection of sample IDs grouped together for a particular reason. In this example, one set was put together to represent all the samples for this participant. A participant ID isn’t needed here because the sample ID connects this table with the sample table, where the participant ID already exists. Sets can be used for grouping participants and pairs as well.

The FireCloud data model structure was originally designed to be suitable for TCGA studies, but the data model can and has been used to organize data for a variety of study types besides cancer analysis.

Data files live in Google buckets

So the data model helps organize your metadata. Now, what is the relationship between the data model and Google buckets? Well, when you need to reference in your data model any data or metadata that exists as a file or set of files, you need to store those files in a Google bucket, and reference that location (gs://bucket-id/path-to-file) in your data model. The next question is therefore, which bucket do you put your data in?

A workspace is backed by a Google bucket, primarily so that once a computation is done, the output files (see diagram) have a place to live. You can also use this bucket as a place to store input data for analysis -- or you can use links to inputs from other Google buckets.

For example, you don’t have to store your own copy of a genome reference Fasta, but can link to the publicly available version that the Broad Institute hosts. This saves you the storage cost and enables the research community to keep a single copy of reference data rather than having multiple copies proliferate.

Note that because of technical reasons that have to do with authentication but are too long to go into here, it is generally best to store your data in the workspace bucket itself, and only reference external buckets if those are fully public. If you don't have that luxury, your proxy-group (found on your profile page) will need to be granted access to any private buckets you use.

Additional Resources

(howto) Import metadata
- Templates for uploading metadata (Step 4) into the data model
(howto) Overwrite and delete data from the data model
(howto) Upload files to your Google bucket - video
(howto) Upload files to your Google bucket - written

Updated on 2018-08-15

Report abuse