Frequently Asked Questions

JGI Data Portal FAQ

Get answers to common questions about the Data Portal and its services

JGI Data and Metadata

Searching and Filtering

Search Examples and Tips

Command line downloading

Globus Downloading

Privileged Access: Accessing Private Data

JGI Data and Metadata

What differentiates JGI's data from that offered by other institutions?

Key differentiators for JGI's data

JGI is known worldwide for the high quality of our genomic and metagenomic data, and we take great pride in providing these data outputs to our users. We take numerous steps to ensure our data quality, including the following:

We start with top-quality samples: Our projects begin with samples that meet the highest quality standards, increasing the odds of continued quality further in the pipeline.
We conduct ongoing quality control: Our team quality-checks samples and lab outputs before sequencing and before data are sent to analysis.
We draw on accumulated knowledge: We have more than twenty years’ experience, and we’re actively recruiting new team members to build our shared knowledge.
We provide deeper metagenome sequences than many other institutions: We offer deeper metagenome sequences than many other institutions, promoting greater flexibility of analysis.
We develop new tools: As research evolves, so too do our tools — we’re always developing new tools and techniques to keep pace with researchers. We also do extensive in-house testing before releasing tools to the public.

What types of data (files) are available on the Data Portal?

Available Project Types

The main search on the Data Portal provides access to public data for projects going back to 2002...provided that those files have key identifiers associated with them so the Data Portal search engine can find the files and then group them appropriately.

Available project types include:

Genomes and annotations
Resequencing
Metabolomics

To access private data, users should navigate to the My Data Portal section of the site. See the My Data Portal section below for more information.

What is the difference between data and metadata?

When the JGI Data Portal (JDP) refers to data, we are typically talking about files. JDP does not interrogate the contents of files, so we are typically not aware of what information is inside the file.

When JDP refers to metadata, we are talking about the file descriptors. These are tags that are contained in a document about the file, and can be searched via our Elasticsearch index (after the documents have been ingested). These tags include: project identifiers (IDs), project names, organism names, NCBI taxon IDs, etc. We will be expanding to include more sample and library-related metadata.

In short, JDP searches metadata to find data (files).

Why don't I see raw sequence files for externally submitted datasets?

Some annotations that you can find on the Data Portal have been created from assemblies that were generated at a different facility and submitted to JGI for annotation.

JGI does not have the raw sequence files for these assemblies and annotations. You will need to find those files at NCBI or contact the PI.

Searching and Filtering

How can I search on the JGI Data Portal?

Searching on the Data Portal

The Data Portal’s updated search functionality makes it easier than ever for you to find the genomic or metagenomic data you’re looking for. Although the search feature doesn’t prohibit any types of queries, we recommend that you search using any of the following types of queries:

Genome or metagenome name
JGI project name
PI name
Any of the various IDs associated with the data you’re seeking

These types of search queries yield the most accurate results. Whenever possible, we encourage you to use the most complete search term you can (for example, a PI’s full name or the full name of a genome); if you don’t have a full search term, enter as much information as you have available.

In addition to updated filter options, our search includes other new features, including typeahead and cross-kingdom searching.

To use the typeahead feature, enter a query in the search box — a list of recommendations will be generated based on the terms or ID you entered. You can either enter your full search query (term) in the search box and hit Return to initiate a search, or you can choose an option from the typeahead list to initiate the search.

The Data Portal defaults to cross-kingdom search— that is, it searches data from all of the individual kingdom portals (Phytozome, PhycoCosm, MycoCosm, and IMG). If you’d like to search data from within a single kingdom portal, you can indicate this using the dropdown menu that appears at the left edge of the search box. Click the downward-facing arrow and choose a portal name from the dropdown list; then, enter your search term and hit Return — your search will only return results from within the specified domain. If you’d like to search for something across all portals, select “Everything” from the dropdown menu.

Search Examples and Tips

See our Search Tips section for examples of how to create more complex searches, information on how JDP's search works and information on filtering.

Why am I seeing certain results?

Seeing Certain Results

How the Data Portal Searches

The JGI’s Data Portal searches more than 200 metadata fields in our Elastic Search (ES) Index for your search parameter, the most important fields are project names, organism names, NCBI taxonomy and file names. Search terms that match one of these categories will have their search relevancy score boosted and will be presented in the “Most Relevant Results” category.

If you would like to improve the relevancy of your search, please provide a more detailed search parameter.

If it seems like some results are missing, you can get more results by setting your filter to “Show All Results”, attempt to broaden your search, or reach out to the Data Portal team.

What the Data Portal Searches

Only public data that is produced and/or processed at JGI is available through the Data Portal's search mechanism. If you need access to private data, please visit the Genome Portal. or (starting June 2024) the My Data Portal section of the Data Portal (https://data.jgi.doe.gov/mydata).

How the Data Portal Displays Search Results

When a single file matches your search criteria, the Data Portal will display all files that are part of that file’s group. The groups are presented as panels that you can expand and collapse.

Occasionally, you may conduct a search and see results where your search term does not appear in the result table. JGI’s Elastic Search (ES) Index contains over 200 metadata fields. The Data Portal does not display all of the metadata fields it has searched when it reports these results in the browser or to a client using the API.

If you expect to see many thousands of results returned from search and you are not seeing them, please be aware that the Data Portal returns a maximum of 10,000 files for any given search. You can try to get more relevant results by providing more detailed or specific search parameters, or you can reach out to the Data Portal team to solve your query needs.

What is the difference between public and private data?

The difference between Public and Private Data

Public data at JGI are data associated with a completed project, for which the embargo period has ended.

Private data at JGI are data with restrictions on their visibility and usage, limited to those with privileged access. Typically, private data is associated with projects that have not been completed or that remain under embargo.

What is the difference between restricted and unrestricted data?

The difference between Restricted and Unrestricted data

The JGI Data Portal provides access to both unrestricted and restricted data. Your role and your relationship to JGI determine how you use the data responsibly.

Unrestricted data is data that has already been published, or that was made available to the public two or more years ago. Unrestricted data can be used by anyone - ie, you don’t need any special permission to download and analyze it. However, you may need to include a citation when using unrestricted data in a publication.

Restricted data is also available to the public, but there are restrictions on how it can be used. In most cases you will need to contact the PI for permission to use and cite the data in your own publication. Data from Phytozome may include additional restrictions that can be found in a genomes’ Data Release Policy File. Additional restrictions on data from MycoCosm are described on individual genome pages.

Whether you're working with restricted or unrestricted data, it's always best to contact a PI before using their data in your publication. You can find the PI's email addresses on the kebab for each dataset (the three vertical dots at the far-right of the genome/dataset row) or in the file manifest that is provided with your data.

Why is it important to filter my search results?

The benefits of filtering data

The Data Portal has robust filters you can use to locate exactly the data you’re looking for. Using these filters offers a number of benefits: It allows you to download only the data that’s relevant to your research, reducing the amount of time you spend organizing data (and the amount of storage space you need). In addition, it expedites the download process — more narrowly defined data sets download more quickly than larger ones.

Using filters also benefits the larger JGI user community. More targeted requests, especially those that include files stored in the archived files, can be processed more quickly by JGI; this, in turn, means that more external users can have their files delivered more quickly.

How can I filter data prior to downloading it?

Using filters

Apply filters to a list of genomes, metagenomes, or files to more quickly locate the data you’d like to download.

To apply filters:

If necessary, expand the filter menu by clicking anywhere on the panel labeled Filters. Once the menu is expanded, you’ll see filter groupings for:
- - Environment
  - Taxonomy
  - Dataset
  - File Property
Use the dropdown menus to view available filter criteria.
Select filter criteria by clicking directly on the menu items you’d like to choose; you can select as many criteria as you like from any of the available dropdown menus.

Selected filters will apply to your results list automatically. Applied filters are indicated by blue number icons within the dropdown menus from which filter options were selected; to see what specific filter options you’ve applied, click into the relevant dropdown menus — your choices will be indicated by checked checkboxes.

To clear filters, either deselect individual items from their dropdown menus or click the Clear All button. Your results list will update automatically each time you clear one or multiple filters.

Currently, the Data Portal allows you to filter data at the dataset level and the file level.

Environment filters

On the Data Portal, you can filter by the following environmental filters.

Ecosystem
Ecosystem category
Ecosystem type
Ecosystem subtype
Specific ecosystem

Learn more about these filters.

Taxonomy filters

The Data Portal currently offers four taxonomy level filters:

Class
Order
Family
Genus

Dataset filters

The Data Portal currently offers 2 dataset level filters:

Version
Dataset Type (JGI Product Type)

File-level filters by segment

The JGI Data Portal allows you to both search data from across all of its segments, or to search within only a specific segment. Because the data within each portal is slightly different, each segment has slightly different filtering options.

Everything (all segments)

When you’re searching for data across all segments (portals), you’ll see the following file-level filter options:

File type, which includes a full list of file extensions (e.g., FASTA, GFF, GFF3).
File availability, which indicates whether a file is available for immediate download or must be retrieved from the tape archive.
Data type, which is a further way to describe the type of file (e.g., qc data, raw data, primary alleles, secondary alleles).
Data group, which is a broad description of the type of data sought (analysis data, sequencing data, and so on).
Data usage, which is a way to find data that is unrestricted - ie, you are not required to contact a PI before you use it in a publication.
File Name Pattern, which is a way to use a pattern (regular expression) to filter files, so you see only those that match your desired pattern.

Example 1:
1. You may want to find all of the files that end in assembled.faa
2. Pattern (regular expression): .*assembled\.faa
Example 2:
1. You want to see only files that end in .gff
2. Pattern: .*\.gff
Example 3:
1. You want to see only .gff and .pdf files
2. Pattern: .*\.gff|.*\.pdf

Downloading

What are archived files? How do they affect downloads?

Archived files

If your download includes archived files, you have requested data that is only available on our tape archive system. These archived files will be ready for download within 24 hours. When your files are ready, you will receive an email from JGI. This email will contain a link to a page with all of your requested files.

Archived files are available for download for 14 days after you request them. We recommend downloading files as soon as they become available.

JGI has approximately 13 petabytes of data and only keeps some of this data available for immediate download. Files that have been requested recently are kept on disk storage, and the rest of the files are stored on physical tapes in our tape archive.

How much data can I download in a given day?

Download Limits

Each user has a daily download limit of 10TB. This limit ensures that everyone using JGI’s portals can access the data they need as efficiently as possible.

The JGI team monitors data requests that exceed the daily limit. If you make a request in excess of the daily download limit, a member of the JGI team may reach out to you to learn more about your data needs. If you need 10TB of data or more, we encourage you to make several smaller downloads over the course of several days to prevent any issues or delays.

I'm having trouble downloading large numbers of files

If you are having issues restoring or downloading data, please consider a few things:

Are you searching and initiating downloads through your web browser?

1. If your download is going to take multiple hours (because it is large (size-wise) or your download speeds are slow), please consider downloading via Globus (see below).
  - Globus provides a reliable, restartable download experience that will not be impacted by download interruptions.

Are you interacting on individual files through the API?

If you are requesting restoration of or downloading files individually, you may overwhelm our system if you do not introduce pauses between requests.
1. Option 1:
  1. Please introduce a pause of 30 seconds between requests.
2. Option 2:
  1. Please batch your file restoration or download requests so that you submit multiple files per request.

If you are batching multiple files per request through the API

If your download is going to take multiple hours (because it is large (size-wise) or your download speeds are slow), please consider downloading via Globus (see below).
- Globus provides a reliable, restartable download experience that will not be impacted by download interruptions.

Command line downloading

What is the command line download option? How do I use it?

Command line Downloads

The command line download option (or the API download option) provides a working curl command. It’s free and comes pre-installed on Mac, most Linux distributions, and Windows 10. If it’s not already installed on your machine, you can download curl here: https://curl.haxx.se/download.html

To download the files you’ve selected, simply:

click the Copy to clipboard button
paste the command on the command line (Terminal)
press Enter

A zip file containing the selected files will be downloaded to the folder in which the curl command was run. You can also choose to view the curl command it its entirety.

Curl is very powerful and flexible, and supports options like resuming interrupted downloads. To understand or customize the command provided here, check the curl tutorial page (https://curl.haxx.se/docs/manual.html) or the complete curl documentation (https://curl.haxx.se/docs/manpage.html)

Most GUI download managers and API clients can import curl commands. If you have a preferred client, or wish to experiment with visually retaining and organizing requests, you can use curl as a bridge to other applications.

How can I learn how to use the API? Where can I find API documents?

API Documentation

You can find JDP API documentation in Swagger here. Swagger allows you to play with the API in a user-friendly environment.

You can find an API Tutorial here.

Globus Downloading

What is Globus and why should I use it?

What is Globus for?

Globus Online and Globus Connect are fantastic tools that have been developed by the University of Chicago to ease the burden of large data transfers for researchers.

Why should I use Globus?

Globus is fast, reliable, and convenient.

Fast: Globus Online and Globus Connect use multiple streams to transfer your data faster.
Reliable: Globus Online and Globus Connect can handle network errors gracefully so you don’t have to worry about interruptions.
Convenient: Globus Connect Personal manages your downloads automatically until they complete.
It suspends transfers when your computer goes to sleep, and resumes when it turns back on.
More information:
- https://www.globus.org/globus-connect-personal
- https://www.globus.org/data-transfer
Server to Server Transfer: By using your institutional endpoint, your downloads won’t take up hard disk space on your personal computer.

What is an endpoint?

A Globus endpoint is a computer or server with Globus Connect Personal or Globus Connect Server installed and configured to receive or initiate data transfers.

https://docs.globus.org/faq/globus-connect-endpoints/#what_is_an_endpoint

Can I receive data from Globus on my own computer?

Yes, by installing and configuring Globus Connect Personal, you can transfer files directly to your computer.

How do I set up a Globus endpoint on my computer?

https://docs.globus.org/how-to/

Globus sounds pretty cool. How do I create an account?

https://docs.globus.org/how-to/get-started/

Server: My organization has an endpoint that I want to use.

https://app.globus.org/

Laptop: I want to set up an account that isn’t associated with my organization.

https://www.globusid.org/create

Does JGI store any of my Globus information?

JGI stores only the Globus username that you provide. JGI does not store any other Globus credentials.

How will I know when my data is ready to be downloaded via Globus?

JGI’s Data Portal sends you an email when your files are available for transfer.

The files remain available for transfer for 14 days.

How do I use Globus to download data from JGI’s Data Portal?

From the results page,
- when you have selected the files you want to download,
- and if you have an active Globus account,
- and have a Globus endpoint that can accept transfers,
Click the “Download” button
Click “Globus Download”
Click to acknowledge JGI’s Data Utilization Policy
Enter your Globus username and select “Submit”
Wait (up to a day) for an email from JGI’s Data Portal informing you that your files are ready to be transferred
Click on the link in the email and provide the name of your Globus endpoint
Start your transfer

https://docs.globus.org/how-to/get-started

Where can I learn more about Globus?

Globus documentation: https://docs.globus.org/
Globus Online (1 min): https://youtu.be/QwVWJF6nRKI
Globus Video Collection: https://www.youtube.com/user/GlobusOnline/videos
Globus Connect Personal: https://youtu.be/bpnVcAN99WY
What data does Globus store: https://youtu.be/d3cbbUwxMwQ
Globus Online Webcast: https://youtu.be/4hsL6vTc1Yg
Globus Online (1 min): https://youtu.be/QwVWJF6nRKI
Globus 101 (1.5 hrs): https://youtu.be/K17ZZEIvWhg

Privileged Access: Accessing Private Data

What is Privileged Access? How do I see my private data?

What is Privileged Access?

Privileged Access is gaining access to data to which you have been granted access - ie, private data that others want you to see.

Through the My Data Portal section of JGI's Data Portal you can gain access to this private data.

How can I see my private data?

You can access My Data Portal by clicking on the My Data Portal link at the top of any JGI Data Portal page, the link in your avatar dropdown, or clicking on this link: https://data.jgi.doe.gov/mydata

Sections of My Data Portal

My Projects
- See a list of your projects - those on which you are PI, as well as those you have been granted access to (Co-PI, collaborator, etc). Filter and sort to find your projects of interest.
- Grant Access
  - As a Proposal or Project Principle Investigator (or Co-PI), grant other users access to data that is not yet public
- Download Data
- Contact Project Managers
Coming Soon:
- My Requests & Downloads
  - See your data restoration request and download history

How do I gain/grant access to private data?

Requirements

You must have an ORCID ID associated with your JGI account.
- You can create an ORCID ID here: https://orcid.org/
- You can associate your ORCID ID with your JGI account here: https://contacts.jgi.doe.gov/edit_contact

Gaining/Granting Access to Data

Principle Investigators (PIs) and Co-PIs can grant access to JGI Users using ORCID IDs.
- My Data Portal --> Select Projects --> Grant Access
If you want to request access to a dataset, you need to reach out to a PI or Co-PI of the dataset. There is no mechanism for searching JDP for private datasets for which you do not have access.

Important Note:

JGI will be moving to a new authentication/authorization system. Associating your ORCID ID with your JGI account now will ease the move to the new system which will require authentication (login) through ORCID.

Page updated

Report abuse