JGI is known worldwide for the high quality of our genomic and metagenomic data, and we take great pride in providing these data outputs to our users. We take numerous steps to ensure our data quality, including the following:
We start with top-quality samples: Our projects begin with samples that meet the highest quality standards, increasing the odds of continued quality further in the pipeline.
We conduct ongoing quality control: Our team quality-checks samples and lab outputs before sequencing and before data are sent to analysis.
We draw on accumulated knowledge: We have more than twenty years’ experience, and we’re actively recruiting new team members to build our shared knowledge.
We provide deeper metagenome sequences than many other institutions: We offer deeper metagenome sequences than many other institutions, promoting greater flexibility of analysis.
We develop new tools: As research evolves, so too do our tools — we’re always developing new tools and techniques to keep pace with researchers. We also do extensive in-house testing before releasing tools to the public.
The main search on the Data Portal provides access to public data for projects going back to 2002...provided that those files have key identifiers associated with them so the Data Portal search engine can find the files and then group them appropriately.
Available project types include:
Genomes and annotations
Resequencing
Metabolomics
To access private data, users should navigate to the My Data Portal section of the site. See the My Data Portal section below for more information.
When the JGI Data Portal (JDP) refers to data, we are typically talking about files. JDP does not interrogate the contents of files, so we are typically not aware of what information is inside the file.
When JDP refers to metadata, we are talking about the file descriptors. These are tags that are contained in a document about the file, and can be searched via our Elasticsearch index (after the documents have been ingested). These tags include: project identifiers (IDs), project names, organism names, NCBI taxon IDs, etc. We will be expanding to include more sample and library-related metadata.
In short, JDP searches metadata to find data (files).
The Data Portal’s updated search functionality makes it easier than ever for you to find the genomic or metagenomic data you’re looking for. Although the search feature doesn’t prohibit any types of queries, we recommend that you search using any of the following types of queries:
Genome or metagenome name
JGI project name
PI name
Any of the various IDs associated with the data you’re seeking
These types of search queries yield the most accurate results. Whenever possible, we encourage you to use the most complete search term you can (for example, a PI’s full name or the full name of a genome); if you don’t have a full search term, enter as much information as you have available.
In addition to updated filter options, our search includes other new features, including typeahead and cross-kingdom searching.
To use the typeahead feature, enter a query in the search box — a list of recommendations will be generated based on the terms or ID you entered. You can either enter your full search query (term) in the search box and hit Return to initiate a search, or you can choose an option from the typeahead list to initiate the search.
The Data Portal defaults to cross-kingdom search— that is, it searches data from all of the individual kingdom portals (Phytozome, PhycoCosm, MycoCosm, and IMG). If you’d like to search data from within a single kingdom portal, you can indicate this using the dropdown menu that appears at the left edge of the search box. Click the downward-facing arrow and choose a portal name from the dropdown list; then, enter your search term and hit Return — your search will only return results from within the specified domain. If you’d like to search for something across all portals, select “Everything” from the dropdown menu.
See our Search Tips section for examples of how to create more complex searches, information on how JDP's search works and information on filtering.
The JGI’s Data Portal searches more than 200 metadata fields in our Elastic Search (ES) Index for your search parameter, the most important fields are project names, organism names, NCBI taxonomy and file names. Search terms that match one of these categories will have their search relevancy score boosted and will be presented in the “Most Relevant Results” category.
If you would like to improve the relevancy of your search, please provide a more detailed search parameter.
If it seems like some results are missing, you can get more results by setting your filter to “Show All Results”, attempt to broaden your search, or reach out to the Data Portal team.
Only public data that is produced and/or processed at JGI is available through the Data Portal's search mechanism. If you need access to private data, please visit the Genome Portal. or (starting June 2024) the My Data Portal section of the Data Portal (https://data.jgi.doe.gov/mydata).
When a single file matches your search criteria, the Data Portal will display all files that are part of that file’s group. The groups are presented as panels that you can expand and collapse.
Occasionally, you may conduct a search and see results where your search term does not appear in the result table. JGI’s Elastic Search (ES) Index contains over 200 metadata fields. The Data Portal does not display all of the metadata fields it has searched when it reports these results in the browser or to a client using the API.
If you expect to see many thousands of results returned from search and you are not seeing them, please be aware that the Data Portal returns a maximum of 10,000 files for any given search. You can try to get more relevant results by providing more detailed or specific search parameters, or you can reach out to the Data Portal team to solve your query needs.
Public data at JGI are data associated with a completed project, for which the embargo period has ended.
Private data at JGI are data with restrictions on their visibility and usage, limited to those with privileged access. Typically, private data is associated with projects that have not been completed or that remain under embargo.
The JGI Data Portal provides access to both unrestricted and restricted data. Your role and your relationship to JGI determine how you use the data responsibly.
Unrestricted data is data that has already been published, or that was made available to the public two or more years ago. Unrestricted data can be used by anyone - ie, you don’t need any special permission to download and analyze it. However, you may need to include a citation when using unrestricted data in a publication.
Restricted data is also available to the public, but there are restrictions on how it can be used. In most cases you will need to contact the PI for permission to use and cite the data in your own publication. Data from Phytozome may include additional restrictions that can be found in a genomes’ Data Release Policy File. Additional restrictions on data from MycoCosm are described on individual genome pages.
Whether you're working with restricted or unrestricted data, it's always best to contact a PI before using their data in your publication. You can find the PI's email addresses on the kebab for each dataset (the three vertical dots at the far-right of the genome/dataset row) or in the file manifest that is provided with your data.
The Data Portal has robust filters you can use to locate exactly the data you’re looking for. Using these filters offers a number of benefits: It allows you to download only the data that’s relevant to your research, reducing the amount of time you spend organizing data (and the amount of storage space you need). In addition, it expedites the download process — more narrowly defined data sets download more quickly than larger ones.
Using filters also benefits the larger JGI user community. More targeted requests, especially those that include files stored in the archived files, can be processed more quickly by JGI; this, in turn, means that more external users can have their files delivered more quickly.
Apply filters to a list of genomes, metagenomes, or files to more quickly locate the data you’d like to download.
To apply filters:
If necessary, expand the filter menu by clicking anywhere on the panel labeled Filters. Once the menu is expanded, you’ll see filter groupings for:
Environment
Taxonomy
Dataset
File Property
Use the dropdown menus to view available filter criteria.
Select filter criteria by clicking directly on the menu items you’d like to choose; you can select as many criteria as you like from any of the available dropdown menus.
Selected filters will apply to your results list automatically. Applied filters are indicated by blue number icons within the dropdown menus from which filter options were selected; to see what specific filter options you’ve applied, click into the relevant dropdown menus — your choices will be indicated by checked checkboxes.
To clear filters, either deselect individual items from their dropdown menus or click the Clear All button. Your results list will update automatically each time you clear one or multiple filters.
Currently, the Data Portal allows you to filter data at the dataset level and the file level.
On the Data Portal, you can filter by the following environmental filters.
Ecosystem
Ecosystem category
Ecosystem type
Ecosystem subtype
Specific ecosystem
Learn more about these filters.
The Data Portal currently offers four taxonomy level filters:
Class
Order
Family
Genus
The Data Portal currently offers 2 dataset level filters:
Version
Dataset Type (JGI Product Type)
The JGI Data Portal allows you to both search data from across all of its segments, or to search within only a specific segment. Because the data within each portal is slightly different, each segment has slightly different filtering options.
When you’re searching for data across all segments (portals), you’ll see the following file-level filter options:
File type, which includes a full list of file extensions (e.g., FASTA, GFF, GFF3).
File availability, which indicates whether a file is available for immediate download or must be retrieved from the tape archive.
Data type, which is a further way to describe the type of file (e.g., qc data, raw data, primary alleles, secondary alleles).
Data group, which is a broad description of the type of data sought (analysis data, sequencing data, and so on).
Data usage, which is a way to find data that is unrestricted - ie, you are not required to contact a PI before you use it in a publication.
File Name Pattern, which is a way to use a pattern (regular expression) to filter files, so you see only those that match your desired pattern.
Example 1:
You may want to find all of the files that end in assembled.faa
Pattern (regular expression): .*assembled\.faa
Example 2:
You want to see only files that end in .gff
Pattern: .*\.gff
Example 3:
You want to see only .gff and .pdf files
Pattern: .*\.gff|.*\.pdf
If your download includes archived files, you have requested data that is only available on our tape archive system. These archived files will be ready for download within 24 hours. When your files are ready, you will receive an email from JGI. This email will contain a link to a page with all of your requested files.
Archived files are available for download for 14 days after you request them. We recommend downloading files as soon as they become available.
JGI has approximately 13 petabytes of data and only keeps some of this data available for immediate download. Files that have been requested recently are kept on disk storage, and the rest of the files are stored on physical tapes in our tape archive.
Each user has a daily download limit of 10TB. This limit ensures that everyone using JGI’s portals can access the data they need as efficiently as possible.
The JGI team monitors data requests that exceed the daily limit. If you make a request in excess of the daily download limit, a member of the JGI team may reach out to you to learn more about your data needs. If you need 10TB of data or more, we encourage you to make several smaller downloads over the course of several days to prevent any issues or delays.
If you are having issues restoring or downloading data, please consider a few things:
If your download is going to take multiple hours (because it is large (size-wise) or your download speeds are slow), please consider downloading via Globus (see below).
Globus provides a reliable, restartable download experience that will not be impacted by download interruptions.
If you are requesting restoration of or downloading files individually, you may overwhelm our system if you do not introduce pauses between requests.
Option 1:
Please introduce a pause of 30 seconds between requests.
Option 2:
Please batch your file restoration or download requests so that you submit multiple files per request.
If your download is going to take multiple hours (because it is large (size-wise) or your download speeds are slow), please consider downloading via Globus (see below).
Globus provides a reliable, restartable download experience that will not be impacted by download interruptions.
The command line download option (or the API download option) provides a working curl command. It’s free and comes pre-installed on Mac, most Linux distributions, and Windows 10. If it’s not already installed on your machine, you can download curl here: https://curl.haxx.se/download.html
To download the files you’ve selected, simply:
click the Copy to clipboard button
paste the command on the command line (Terminal)
press Enter
A zip file containing the selected files will be downloaded to the folder in which the curl command was run. You can also choose to view the curl command it its entirety.
Curl is very powerful and flexible, and supports options like resuming interrupted downloads. To understand or customize the command provided here, check the curl tutorial page (https://curl.haxx.se/docs/manual.html) or the complete curl documentation (https://curl.haxx.se/docs/manpage.html)
Most GUI download managers and API clients can import curl commands. If you have a preferred client, or wish to experiment with visually retaining and organizing requests, you can use curl as a bridge to other applications.
Globus Online and Globus Connect are fantastic tools that have been developed by the University of Chicago to ease the burden of large data transfers for researchers.
Globus is fast, reliable, and convenient.
Fast: Globus Online and Globus Connect use multiple streams to transfer your data faster.
Reliable: Globus Online and Globus Connect can handle network errors gracefully so you don’t have to worry about interruptions.
Convenient: Globus Connect Personal manages your downloads automatically until they complete.
It suspends transfers when your computer goes to sleep, and resumes when it turns back on.
More information:
Server to Server Transfer: By using your institutional endpoint, your downloads won’t take up hard disk space on your personal computer.
A Globus endpoint is a computer or server with Globus Connect Personal or Globus Connect Server installed and configured to receive or initiate data transfers.
Yes, by installing and configuring Globus Connect Personal, you can transfer files directly to your computer.
https://docs.globus.org/how-to/
https://docs.globus.org/how-to/get-started/
Server: My organization has an endpoint that I want to use.
Laptop: I want to set up an account that isn’t associated with my organization.
https://www.globusid.org/create
JGI stores only the Globus username that you provide. JGI does not store any other Globus credentials.
JGI’s Data Portal sends you an email when your files are available for transfer.
The files remain available for transfer for 14 days.
From the results page,
when you have selected the files you want to download,
and if you have an active Globus account,
and have a Globus endpoint that can accept transfers,
Click the “Download” button
Click “Globus Download”
Click to acknowledge JGI’s Data Utilization Policy
Enter your Globus username and select “Submit”
Wait (up to a day) for an email from JGI’s Data Portal informing you that your files are ready to be transferred
Click on the link in the email and provide the name of your Globus endpoint
Start your transfer
https://docs.globus.org/how-to/get-started
Globus documentation: https://docs.globus.org/
Globus Online (1 min): https://youtu.be/QwVWJF6nRKI
Globus Video Collection: https://www.youtube.com/user/GlobusOnline/videos
Globus Connect Personal: https://youtu.be/bpnVcAN99WY
What data does Globus store: https://youtu.be/d3cbbUwxMwQ
Globus Online Webcast: https://youtu.be/4hsL6vTc1Yg
Globus Online (1 min): https://youtu.be/QwVWJF6nRKI
Globus 101 (1.5 hrs): https://youtu.be/K17ZZEIvWhg
Privileged Access is gaining access to data to which you have been granted access - ie, private data that others want you to see.
Through the My Data Portal section of JGI's Data Portal you can gain access to this private data.
You can access My Data Portal by clicking on the My Data Portal link at the top of any JGI Data Portal page, the link in your avatar dropdown, or clicking on this link: https://data.jgi.doe.gov/mydata
My Projects
See a list of your projects - those on which you are PI, as well as those you have been granted access to (Co-PI, collaborator, etc). Filter and sort to find your projects of interest.
Grant Access
As a Proposal or Project Principle Investigator (or Co-PI), grant other users access to data that is not yet public
Download Data
Contact Project Managers
Coming Soon:
My Requests & Downloads
See your data restoration request and download history
You must have an ORCID ID associated with your JGI account.
You can create an ORCID ID here: https://orcid.org/
You can associate your ORCID ID with your JGI account here: https://contacts.jgi.doe.gov/edit_contact
Principle Investigators (PIs) and Co-PIs can grant access to JGI Users using ORCID IDs.
My Data Portal --> Select Projects --> Grant Access
If you want to request access to a dataset, you need to reach out to a PI or Co-PI of the dataset. There is no mechanism for searching JDP for private datasets for which you do not have access.
Important Note:
JGI will be moving to a new authentication/authorization system. Associating your ORCID ID with your JGI account now will ease the move to the new system which will require authentication (login) through ORCID.