EDD documentation

Frequently Asked Questions

How do I get data into EDD?

The Experiment Data Depot (EDD) imports data in two steps (Fig. 1):

Experiment Description input: this file describes your experiment design so EDD knows how to store all your data, and how it is related to your strains and samples (see below for more information).
Data input: different types of data can be added in several successive steps. These data input steps are independent of each other, facilitating the combination of different types of data (e.g. multiomics data sets).

The EDD overview page has tutorials and protocols for study creation and data import, as well as links to multiple EDD sites.

Fig. 1: Data input process. Data is imported to EDD in two phases. In the first one, you import an experiment description file, which describes the experiment to EDD so it knows how to store your data. Afterwards, you can add as many data types (e.g. transcriptomics, proteomics,...) as desired in each of the data imports.

What is an experiment description?

An experiment description file is an excel file that describes your experiment (Fig. 2): which strains you are using (part ID from ICE), how they are being cultured (lines and metadata), the procedure for processing samples (protocol), and which samples are being taken (assays). Look at Fig. 3 to see how EDD organizes your experimental data (i.e. the ontology).

The experiment description provides a single file standardized description of your experiment that is useful for, e.g., you to design your experiment, or the proteomics or metabolomics team to understand your experiment so they can plan how they will process your samples.

Input in Excel:

Import result in EDD:

Fig. 2: Examples of experiment description. The upper picture represents the example experiment description file in Excel, with a line name that helps identify the culture, a line description that gives more information on the line, the part ID in the corresponding part repository (public ABF in this case), different types of metadata (shaking speed, ... growth temperature), the number of replicates, and an optional field (in blue): assay information (i.e. a protocol applied to a line at a given time point) for targeted proteomics. The replicate count will create several lines for each replicate (3 for wild type and 4 for the other strain, see below). The assay information is optional: you may want to use this to tell the proteomics or metabolomics services when you are sampling so they can add the data, or you can add the data later yourself. The lower pictures shows how this information is represented in EDD. Notice that the Part IDs have become links to the corresponding registry. Click on the figs for examples.

What is a line?

A "Line" in EDD is a distinct set of experimental conditions, (e.g. a single culture, Fig. 2), which represents a different line of enquiry or question being asked (e.g., does this strain under these conditions improve production?). A Line generally corresponds to the contents of a shake flask or well plate, though it could also be, e.g., a tube containing an arabidopsis seed or an ionic liquid for a given pretreament. A line is not a sample: several samples can be obtained from a single line at different times (see Fig. 3).

A typical experiment (Fig. 3) would take strains from a repository, culture them in different flasks (lines), apply a protocol at a given time (an assay), and obtain different measurement data. Protocols are kept under protocols.io to enable reproducibility and better communication. You can find the LBNL repository here.

How do I choose good line names?

A good way to name your lines involves the strain name, culture conditions and whichever other condition is being changed in the experiment. For example, WT-LB-70C would indicate is a wild type, grown on LB at 70 C (imagine you are trying different growth temperatures). Cineole-EZ-50C indicates a cineole producing strain, grown on EZ at 50C ... etc.

Fig. 3: EDD data organization (ontology). In this example, we have three different strains (A,B, and C). Strain A is cultured in two different flasks, giving rise to two lines (A1 and A2). Strain B is cultured in a single flask, giving rise to a single line B1. Strain C is cultured in three flasks, giving rise to three lines: C1, C2 and C3. Line A2 is assayed through HP-LC (protocol) at times t = 10 hr (assay A2-HPLC-1) and t=8 hr (assay A2-HPLC-2). Assay A2-HPLC-1 produces two measurements: 3 mg/L of Acetate and 2 mg/L of Lactate. Assay A2-HPLC-2 produces two measurements: 2 mg/L of Acetate and 1.5 mg/L of Lactate.

What are the column options for the experiment description file?

The primary line characteristics that you should have in every experiment description and every EDD service (instance) are:

Line Name: a short name that uniquely identifies the line (REQUIRED).
Line Description: A short human-readable description for the line.
Part ID: the unique ICE part number identifiers for the strains involved.
Replicate Count: the number of experimental replicates for this set of experimental conditions .

Other metadata types (e.g. media, temperatures, culture volume, flask volume, shaking speed... etc) are also available, but depend on which EDD site you are using. You can find the acceptable metadata types (i.e. column headers) for each EDD site in (to be done) Alternatively, you can ask your EDD administrator. Columns can be in any order.

What types of data does EDD accept?

EDD sites are typically configured to accept OD, transcriptomics, proteomics, and metabolomics data as a default. Beyond that, each EDD site is flexible to be configured to store whichever data types are needed by the organization that created the site. You can see which measurement types your EDD site accepts by adding "/load/help/#generic-mtypes" to its internet address. For example https://public-edd.agilebiofoundry.org/load/help/#generic-mtypes gives you the measurement types accepted for the public ABF EDD site (Fig. 4). These names are what you are supposed to use under the "Measurement Type" header (See Fig 5 below).

Transcriptomics, proteomics, and metabolomics data are special in the sense that you need to put the corresponding ID (identifier) for the gene/protein/metabolite under the "Measurement Type" header. We use systematic identifiers to avoid having a multitude of names for the same thing (e.g: "glucose", "gluc","consumed_gluc", "glc" for glucose). EDD uses the following identifiers for --omics data:

Transcriptomics: Genbank gene IDs for identifier (e.g. b0180).
Proteomics: UniProt ID for identifier (e.g. P17115). If the protein is not in Uniprot yet, you can use an ICE identifier.
Metabolomics: Pubchem ID for identifer (e.g. CID:5793).

You can see their corresponding units below.

Fig. 4: Available data types for the public ABF EDD site. Information obtained in June 2021. The available types change as requests are made to the administrators.

What type of files does EDD require for data input?

For each type of data there are two alternatives for input files:

Generic input files are the simplest option, since they are the same for all data types. Generic input files in EDD have five columns. The first column specifies the line (e.g., WT). The second one is the measurement type identifier (e.g. which protein or metabolite), The third column is the time point (e.g., 1 h). The fourth column is a value for the corresponding identifier (e.g., 2.4). The final column is the unit corresponding to this value (e.g. nM).

Figure 5 shows an example of generic file for metabolites data. The first one is the line name, corresponding to a wild type E. coli. The second column is the type of measurement, these are metabolites identified by Pubchem identifiers. For example, CID:5793 is glucose, and CID:12988 is isoprenol. The third column is the time point at which the measurement was taken. For example, 0 hours, and 1 hour for the last two. The fourth and fifth column indicate the amount of the metabolite and the units. For example, the second row indicates 3.71 mM of acetate. Examples for transcripts and proteomics can be found in Figs 6 and 7.

Fig. 5: Metabolite data import file example. First column indicates line, second column indicates metabolite through a Pubchem identifier, third column indicates time point, and the final two columns indicate amount of metabolite and units

Fig. 6: Transcript data import file example. First column indicates line, second column indicates gene through a Genbank gene ID, third column indicates time point, and the final two columns indicate amount of transcript and units.

Fig. 7: Protein data import file example. First column indicates line, second column indicates metabolite through a UniProt ID, third column indicates time point, and the final two columns indicate amount of protein and units.

Non-generic data files allows you to feed into EDD the files straight out of your instrument, since putting data in the generic format might involve some effort. So far we have conversion scripts for

and these should be soon integrated into EDD so you can upload these files directly.

Figure 8 shows an example of data file from the Agilent HPLC which can be easily converted into EDD input through these scripts.

Fig. 8: Agilent HPLC output data. This file can be converted into an EDD generic file (see above) through a set of scripts that will be soon be included in EDD so you can input this file directly, instead of the generic one.

What type of units does EDD use?

The measurement types discussed above can use several units. For --omics and OD data you should use:

Transcriptomics: RPKM, RPM or TPM.
Proteomics: proteins/cell or counts.
Metabolomics: g, mM, mmol, mol, g/L, mg/L, mol/L, mM, nM, uM.
OD : no units, n/a.

You can find which units your EDD site accepts by adding "/load/help/#units" to its internet address. For example https://public-edd.agilebiofoundry.org/load/help/#units gives you the accepted units for the public ABF EDD site (Fig. 7).

Fig 9. Units available in the public ABF EDD site. Information obtained in June 2021. More units can be made available by asking the administrators. ------------------------>

Why should I use the Experiment Data Depot?

The Experiment Data Depot (EDD) is a standardized repository of experimental data. This is useful for the following reasons:

EDD provides a single point of storage for your experimental data, to be easily referenced. Instead of providing a collection of spreadsheets organized in an adhoc manner in the supplementary material of your paper, you can give a single URL where your readers can find all the data in a format that is always the same. This will make your papers more likely to be cited. In the same way that storing your strain information in the Inventory of Composable Elements (ICE) will make it easier to access and more likely to be cited.
Easily collate different types of multiomics data. Comparing the results of phenotyping a cell using transcriptomics, proteomics and metabolomics can be complicated. EDD facilitates this task with the use of a standard vocabulary for genes, proteins and metabolites, solving the problem of leveraging multiomics data.
EDD facilitates data analysis. By using a standard data format through EDD, you can leverage previously created Jupyter notebooks to easily do your calibrations and statistics (e.g. calcualte error bars).
Advanced Learn techniques. EDD helps you interact with data scientists to used Machine Learning and Artificial Intelligence techniques to effectively guide metabolic engineering. Just give them the link of your study and you will save them the wrangling of spreadsheets that consumes 50-80% of their time.

Why can't I see the data in the link?

Most likely the person who sent you the link did not add you in the study permissions. You can ask them to add you by going to the "Overview Tab", the click on "Permissions", "Configure Permissions" and then choosing you among the users, and setting the permission to "Read". They can give access to everyone by choosing "Read" in the menu next to "Everyone" (see Fig 10).

Fig. 10: Changing permissions to make your study viewable. Studies are by default only viewable by you initially. You must change permissions for other people to access them.

How do I give EDD access to my collaborators?

Please have a look above to see how to change permissions so your studies can be accessed by those you send a link to.

How do I publish EDD datasets?

In the same way that you can provide a single link to describe all the strain information for a given project by using the Inventory of Composable Elements (ICE), you can provide all your experimental data information with a single link. This will make your data usable by other researchers and increase your citations.

In order to provide all your experimental data with a single internet link, you need to import your data into EDD as per the tutorials. Then, all you have to do is include the study link that you can see in your web browser. You can see how this works in this paper (and corresponding link) and this paper (and corresponding link), for example.

Please, do hang on to your inputs, since transferring data from the private version to the public version of EDD is something we are still working on.

EDD advanced use

What is a slug?

A slug is the last part of the internet address corresponding to the EDD study. You will need this slug in order to retrieve the data through the API into e.g. a Jupyter notebook (Fig. 11). You will obtain the data as a pandas dataframe in your Jupyter notebook and you can apply whichever python code you want. You can see a demonstration of this in this paper and a tutorial with screencasts here.

Fig 11 : Retrieving data from EDD through slugs. Use the last part of the internet address of your study to access the study data through a REST API (/rest/docs in your depot site). You can then use the study data in your Jupyter notebooks. ------->

What is an assay?

An assay in EDD terminology roughly corresponds to a sample for a given Protocol. More precisely, it is the application of a Protocol on a specific Line. However, you can understand it as a sample for most purposes (see below): for example, if you apply proteomics to study protein expression of Line C1, an E. coli culture (see Fig. 2 in the original paper), the sample you take that will go into your mass spectrometer will correspond to assay C1-PROT-1.

However, if we want to get precise and technical, this is only true for destructive protocols. For destructive protocols the sample is destroyed after applying the protocol (e.g., proteomics). For this type of protocols, EDD only assigns one time point per assay. There are also non-destructive protocols like, e.g., the OD measurement by an optode by Biolector, which provide measurements for OD for several time points without destroying the cells it measures. For non-destructive protocols, EDD assays may have several timepoints associated to them.

Bulk line creation

In development...

Combinatorial line creation

In development....

Assay creation

In development....

Combinatorial Assay creation

In development....

Worklists

A worklist is a specific type of export, meant to assist in generating data for import into EDD, following the successful definition of an Experiment Description. Worklist support currently only covers two specific uses; expanded support for more flexible worklist creation is coming soon. The "Agilent LC-MS" worklist creates an input template for processing proteomics samples through Agilent instruments (TODO: link to the protocols.io for proteomics). The "Blank Generic Data Template" worklist makes a template file ready for upload to EDD via the "Import Data" button; just fill in the columns for Measurement Type, Time, Value, and Units.

Starting your own EDD service

You can learn how to start your own service of EDD (like https://edd.jbei.org/ or https://edd.agilebiofoundry.org/) by reading the installation instructions.

For more information, please email ese-robotics@lbl.gov

Page updated

Report abuse