The grapevine research community generates large omics datasets to address the major challenge for viticulture and oenology. These datasets are processed in the context of a single study trying to answer one biological question. The value of data from individual experiments is much enhanced when considered in a wider context through meta-analysis. The meta-analysis of these integrated datasets helps to identify mechanisms underlying the interactions between plants, their environment and plant management techniques. However, the interpretation and re-processing of data requires quality metadata to provide appropriate context and in general a compliance to the FAIR principles.
There are several reasons behind developing specific guidelines for the grapevine research community:
Current guidelines and ontologies supporting data standardization cannot always be applied directly to grapevine because its unique traits differ from model organisms. For that reason the Integrape consortium (COST action 17111, funded by the Horizon 2020 framework of the European Union) has developed specifications for grapevine data. The guidelines for submission of nucleotide sequences to ENA is one of them.
The grapevine researchers have generally a biologist background, with a very diverse level of skills in computational biology. This is true for the members of the grapevine research community that submit data to public archives, such as ENA while these archives of the INSDC consortium are among the most cited tools of the community for data archival and data reuse through associated data portals.
This tutorial aims at (i) facilitating data submission itself and (ii) enhancing the quality of the metadata associated to the submission having in mind that there is currently neither clear direct benefits into publishing good metadata since it is not a prerequisite for publication. That is why we are aiming as providing a tutorial which help into making this process as straightforward and simple as possible and balancing what should be mandatory in terms of metadata and what is optional.
ENA accepts sequence reads and associated analyses (e.g. genome assemblies, annotations). Each deposited dataset is composed of data about the dataset or study, about the samples, details about the experiment and the sequence data files.
Once the submitting person/institute decides to make a study public, all data submitted to ENA is exchanged between International Nucleotide Sequence Database Collaboration (INSDC) partners: NCBI and DDBJ. That means in turn, that if new data (read files, experiments) should be added to an already public study, these data get public immediately. Aside from documentation pages, ENA provides submission tutorials and FAQs.
BioProject , BioStudies, Study and BioSample
One important concept to understand before starting submitting data to ENA or its counterpart at NCBI, SRA is the meaning and level of hierarchy between BioStudy, Study and Biosample. As data submitted at ENA are shared at NCBI SRA, a Study will be referred to as a BioProject at SRA, the Bioproject and Study accession number remain the same. At requests of authors it is possible to create an umbrella study that may group projects that are part of a single collaborative effort but represent distinct studies that differ in methodology, sample material, or resulting data type. It is stored in the BioStudies database at EMBL-EBI.
A study is composed of (bio)samples, i.e. physical objects that are composed of biological material and has, or will have data associated to. Here the data are raw reads.
Both studies and samples have a set of associated metadata. Importantly, metadata associated to a study can be extrapolated to the sample: in that case they will be identical between samples. Samples has several type of associated metadata:
Basics details and organism details: generally mandatory data about sample identification
Collection events: description of the sample collection process.
Plant material description: metadata ideally allowing the precise identification of the organism from which the sample come from
Variation event in comparative studies: experimental factors or phenotypical trait that vary between sample, justifying the comparative study.
In addition, it is possible to register environment parameters for each sample.
Go to the registration form: https://www.ebi.ac.uk/ena/submit/sra/#registration
Fill in the Centre Name, the Laboratory Name, Address, etc … of the main contact. Other contacts
may be added later.
After clicking on « Register », you receive a unique “Webin- *” identifier by email, which you must
use with your password to login.
Connect to the ENA Interactive submission service
Go to the login page : https://www.ebi.ac.uk/ena/submit/sra/#home
The “Study” is the whole scientific project (or paper). The description of the study will be carried out only once and could be associated to many data submissions (DNA-Seq, RNA-Seq…). In the « New Submission » tab, check the « Register study » button and click on « Next ».
Metadata associated with a study
Green: required by ENA to proceed with submission.
Yellow: Information that should appear in any submission.
Blue: data depending of the type of experiment and depth of information the authors wish to provide
Some metadata are modelled in ENA databases in specific fields, while others are treated as tags.
Tags: Other attributes can be provided to add a deeper description of the study
“Samples” means here individual nucleic acid samples or tubes, not sequence data. It is not a question in this section of submitting the raw reads (fastq files): it will be done in the next submission step. Sample submission involves describing how the samples were obtained and identifing as precisely as possible the plant material their are derived from.
This can be done either by building an excel spreedsheet for submission or by using default checklists
Building the excel spreedsheet for submission or using default checklists
It is in another document
The necessary metadata are grouped into 4 main categories:
Collection event information: parameters related to the sample acquisition and contextual data
Biological Material, Part and developmental stage of organism: data related to the biological material
The experimental factor and variables tables will be used depending of your experimental design and needs in terms of data integration. If your experiment involve treatment implying different conditions for subsets of your sample list, then information on experimental factor are required. If your sampling method requires first to phenotype your plant material (e.g. sick or healthy material), you will use metadata from the variable table to describe the way to determine the phenotype. Be aware that when your experimental design is about comparing part and development stage, this information will be de facto accessible in the previous table but this information can also be provided in the description of the phenotyping variable, specificaly in the method description.
To generate the tabulated file that you need to fill with your sample data, we propose below a tool that asks you a serie of questions for generating the appropriate field.
The tables at the end of the guidelines will give you more details and help on what is expected in each fields
The file should look as follow.
Once you have your checklist you can directly upload it. You will be directed automatically redirected to the next web page for sample submission. Either you have added your sample in the .csv file or you can add sample manually here.
If you have 10 samples, write 10 and click on « + ».
Manually fill in the information for all of your RNA samples.
And then click on the « Submit » button at the end of the page.
You will receive an email confirming the submission.
You can see the submitted sample in the « Samples » tab.
First, you have to put all the fastq files and the corresponding md5 file on the ENA ftp linked to your webin account. All these files are generally provided by the sequencing platform. Otherwise, if not provided, md5 files can be generated under linux with the command md5sum.
On a linux server, use the following command:
# launch lftp
lftp
# connect to your ftp space
lftp Webin-*:Password@webin.ebi.ac.uk
# put all the fastq files
put /path/to/*.fastq.gz
# put all the corresponding md5 files
put /path/to/*.fastq.gz.md5
Or you can download a windows tool to do that, here is the documentation to install and use it:
https://ena-docs.readthedocs.io/en/latest/submit/fileprep/upload.html#using-webin-file-uploader
After uploading your files, you need to create a raw read submission:
Click on « Submit sequence reads and experiments » and then click on « Next ».
Select your project and click on « Next ».
Then, click on the « Skip » button because you have already created your samples.
Select the file format you will be submitting, usually « One Fastq file » or « Two Fastq files » for RNA samples.
Fill in the technical information for each sample. I advise you to fill in this information in an excel file (spreadsheet) and then upload it by clicking on the button « Upload completed spreadsheet », it is easier.
Example of spreadsheet:
You must give the exact file names of the files you have put on the ftp, along with their md5.
Then click on submit, you will receive an email. Your RNAseq dataset is submitted! you can put your project to public.