Please let us know if you have further questions, if we can improve this section, or if you have requests/suggestions by contacting us at jdp@lbl.gov.
We offer two mailing lists to keep you up to date on new features and documentation.
Browser Interface Mailing List
Announcements regarding enhancements that will be reflected in (or documentation related to) the browser.
API Mailing List
For those who write code, announcements regarding enhancements that will be reflected in (or documentation related to) the API.
How to join our mailing lists.
With the JGI Data Portal (JDP) Application Programming Interface (API), you can automate your downloads or batch them programmatically however you want.
The purpose of this tutorial is to provide a starting point for those who are new to the API. In this tutorial, you will find an overview of several use cases, as well as specific examples. We hope this saves you time and answers most of your initial questions.
If you want to dive right into the API, our API Documentation provides an interactive environment for constructing and testing API calls:
However, before jumping directly into the above interactive documentation, please take a few moments to peruse the rest of this tutorial to learn the basics of how to search, filter, and download data with our API.
We have designed our search endpoint to cover two primary use cases:
You want to download many organisms and don't want to interact directly with pages in the browser interface.
You want to run automated searches.
The Download API has been designed primarily to cover the case where you want to download data to a remote server.
Option 1: You can use our interactive API Documentation page to construct your download API call.
Option 2: You can use the browser interface to construct your list of files to download (search --> add to cart --> view cart), and select "Command Line Download" while in the cart. Copy this command and use this to download your files to a remote host.
Our search engine is backed by Elastic Search, and we index a large number of metadata fields for our files. However, since our data is over 13 Petabytes in size, the files themselves are archived for long-term storage and need to be queued for retrieval prior to being downloaded. Retrieval from our archive typically takes less than an hour (but can take up to one night), and is dependent on the number of concurrent requests (from all users).
The JDP API makes JGI's public data available for download.
Current limitations are that data needs to be part of one of the following scientific programs: microbial, metagenome, fungal, algal or plant.
Follow the links to find public data from the following scientific programs: secondary metabolites, metabolomics, synthetic biology.
In order to restore files or download files, you will need to provide your authorized session token. You can get this by logging in to JDP in your browser and clicking on "Copy My Session Token" in the avatar dropdown menu of your web browser.
Before specifying any file filters or organism filters, you must first conduct an initial search (e.g., for an organism) and then examine the results to determine the available filter values.
This is because our filter values are dynamic and based on what you search for. Although our filter categories (parameters) are always the same, the available values in a filter category will vary based on the metadata in the files that match the initial search term.
Therefore, an initial search should be performed with no filters specified. The server response should then be examined to determine which filter values are available. Finally, the search can be repeated with file filters and/or organism filters applied.
For a more detailed procedure, please refer to the High-Level Workflow (Using API) section directly below.
Construct your initial search query - eg, search for an organism (e coli)
Option 1: Browser Interface
Submit your initial query in the browser interface
Filter your data to focus on the files you care about
Copy the Search API call by pressing "API Search Query Link"
Option 2: Interactive API Environment
Select the endpoint you want to use:
search (for general search)
img_file_list (search for IMG (microbial & metagenome) files only)
mycocosm_file_list (search for Mycocosm (fungal) files only)
phytozome_file_list (search for Phytozome (algal) files only)
Construct and submit your initial query using the Interactive Environment
Review the available filters (keys) and values in the "facets" section of the JSON output
Update your filters in the interactive environment as necessary.
Parse the JSON output to retrieve your file IDs.
See the Parse the JSON Payload section (directly below) for useful information.
(If necessary) Request that your files be restored using the request_archived_files endpoint.
Check the status of your restoration request(s) using the request_archived_files/requests endpoint.
Download your data using the collection of file IDs using the download_files endpoint.
We strongly recommend that you use Interactive Interface to construct and test your API calls. This is the best way to ensure that your queries will work.
You can view the set of available filter categories by visiting our API Documentation, where you can explore the API interactively.
The JSON payload that you will receive is used by the JDP front-end as well as by API users. With this being the case, front-end information is included in the payload that will likely not be useful for API users. This section provides an orientation on important sections of the JSON payload and on sections that may be less useful (or could possibly cause confusion).
Example: https://files.jgi.doe.gov/search/?q=e+coli+BW25113+JBEI-FM002
Example with highlighted fields of interest: search payload
The JSON payload provides pages of data. By default:
the first page of search results is returned
10 datasets are returned on that first page
You will need to either iterate through all of the pages to see all of the data that has been returned, or update the datasets_per_page parameter to view all of the data on one page.
In your search query:
x controls the number of datasets per page
p controls the page number that is returned
Example: https://files.jgi.doe.gov/search/?q=e+coli&x=40&p=2
This will return the 3rd page with 40 elements per page - ie, items 41 through 80.
The organisms array element of the JSON payload represents the datasets and files that have been returned.
The organisms[x].agg and organisms[x].agg_id keys indicate how the dataset is grouped and by what value.
The organisms[x].id key is a concatenation of these 2 values.
The organisms[x].top_hit section provides a summary of the files in that dataset/organism. This section is used to populate the dataset row on the browser interface. For most API users, this section can be ignored.
Values for keys such as proposal, proposal PI, GOLD, NCBI taxon, and FD Project Name should be consistent across all files in the dataset/organism.
The organisms[x].top_hit.file_name and organisms[x].top_hit._id in this section can be ignored.
organisms[x].files[y] item is where you can find the list of files for a particular organism.
organisms[x].files[y]._id item is where you can find the ID that will be submitted to the data restoration or data download endpoints.
organisms[x].files[y].metadata section is where you can find metadata for each file.
organisms[x].files[y].file_status indicates whether the file is on tape (PURGED) or disk (RESTORED).
Facets are keys in the JSON payload that are used as filters.
Facets can be found near the end of the JSON payload.
The values shown are the values by which you can filter your initial query.
In order to restore files or download files via the API, you will need to provide your session token. You can get this by clicking on Copy My Session Token in the avatar dropdown menu of the browser application (after you login).
The request_archived_files POST endpoint will request that your files of interest be restored to disk (from tape).
https://files.jgi.doe.gov/request_archived_files/
The files in your JSON payload will have a file_status with one of the following values:
RESTORED
This means that the file is currently available for download.
Skip to the Download Files via API section below
PURGED
This means that the file needs to be restored from our archive (tape system) to disk before it can be downloaded.
NOTE: You can play it safe and always send a request to restore files before downloading your files. JGI will not restore files if they are already available for download.
Arguments
"ids":
current version: this is an dictionary of dictionaries of "file_id"s, "id"s, "top_hit"s and (when necessary) "mycocosm_portal_id"s collected from organisms[x].files[y]._id
"send_mail"
true/false
do you want to be notified by email when the files are ready to be downloaded - ie, after they have been restored to disk?
"api_version"
required...must be set to 2.
Character Limit
You cannot submit more than 4094 characters to our back-end endpoint.
If you need to submit a payload greater than 4094 characters, submit a file to the "-d" argument in the curl command (examples)
Return Values
The endpoint will return request_status_url which is a URL
{
"ids": {
"Mycocosm_AP-1184792": {
"file_ids": ["51d4c073067c014cd6ea7469", "51d4c1cf067c014cd6ea859e"],
"top_hit": "59cad2a27ded5e2f1869132c",
"mycocosm_portal_id": "Aspni7"
}
},
"send_mail": true,
"api_version": "2"
}
{
"ids": {
"Mycocosm_AP-1184792": {
"file_ids": ["51d4c073067c014cd6ea7469", "51d4c1cf067c014cd6ea859e"],
"top_hit": "59cad2a27ded5e2f1869132c",
"mycocosm_portal_id": "Aspni7"
},
"IMG_AP-1146261": {
"file_ids": ["595b9afd7ded5e5270eef127"],
"top_hit": "595a83767ded5e5270eeebc8"
},
"Phytozome-167": {
"file_ids": ["6643ff0653447aa389b9c859", "6643ff0753447aa389b9c865"],
"top_hit": "67b5004267ef7b237e865486"
}
},
"send_mail": true,
"api_version": "2"
}
This version is old, but still valid.
{
"ids": [
"5503e95a0d878525404e38d7"
],
"send_mail": true,
"api_version": "2"
}
This version is old, but still valid.
{
"ids": [
"5503e95a0d878525404e38d7", "550044bb00d878525404e3a3f"
],
"send_mail": true,
"api_version": "2"
}
curl -X POST "https://files.jgi.doe.gov/request_archived_files/" -H "accept: application/json" -H "Authorization: {paste copied session token here}" -H "Content-Type: application/json" -d "{ \"ids\": [ \"5503e95a0d878525404e38d7\" ], \"send_mail\": true, \"api_version\": \"2\"}"
You can check the status of your files by visiting (in your browser or via CURL) the link that was returned to you when you submitted your file restoration request.
https://files.jgi.doe.gov/request_archived_files/requests/#######
Example:
Non-Globus File Restoration Request: https://files.jgi.doe.gov/request_archived_files/requests/473580
File Restoration Status for Globus Download: https://files.jgi.doe.gov/request_archived_files/requests/488421
This payload will return information about the restoration request
status:
NEW
The request has been collected.
STAGING
For Globus downloads...the data is being moved to the Globus download endpoint.
PENDING
The file restoration request has been made. Some files may be ready, while others have not been transferred to disk.
READY
All files are available on disk and are available to be downloaded.
EXPIRED
Some or all of the files have been purged from disk. You will need to submit another file restoration request.
expiration_date
The date that files will be purged from disk (unless one or many files are requested again).
file_ids
The IDs of the files in your request.
globus_download_url
The URL that can be used to start downloading files through Globus.
This field will be provided regardless of the status of the request.
If you set send_mail = true, you will be notified via email when your files have all been restored.
You will be given a link to download your files through the browser.
This link will direct you to a page that provides a bit more information than the request_status_url that was returned when the restoration request was made.
NOTE: The download endpoint exists on a different host than the search and restore endpoints.
Download host: files-download.jgi.doe.gov
After files have been restored (from tape to disk), the download_files POST endpoint will download files in the list you provide.
https://files-download.jgi.doe.gov/download_files/
Arguments:
"ids"
This is an dictionary of organism ID and list of 24 character file ids collected from organisms[x].files[y]._id
Example format:
"ids" : {
data_portal_organism_id_1 : [24_char_file_id_1-1, 24_char_file_id_1-2, ... , 24_char_file_id_1-n],
data_portal_organism_id_2 : [24_char_file_id_2-1, 24_char_file_id_2-2, ... , 24_char_file_id_2-n]
}
Example format (with fields from search endpoint payload):
"ids" : {
organisms[0].id : [organisms[0].files[0]._id, organisms[0].files[1]._id],
organisms[1].id : [organisms[1].files[0]._id, organisms[1].files[1]._id]
}
"api_version"
required...must be set to 2.
Character Limit:
You cannot submit more than 4094 characters to our back-end endpoint.
If you need to submit a payload greater than 4094 characters, submit a file to the "-d" argument in the curl command (examples)
Return Values:
Output zip file
{
"ids": {
"Mycocosm_AP-1184792": {
"file_ids": ["51d4c073067c014cd6ea7469", "51d4c1cf067c014cd6ea859e"],
"top_hit": "59cad2a27ded5e2f1869132c",
"mycocosm_portal_id": "Aspni7"
}
},
"send_mail": false,
"api_version": "2"
}
{
"ids": {
"Mycocosm_AP-1184792": {
"file_ids": ["51d4c073067c014cd6ea7469", "51d4c1cf067c014cd6ea859e"],
"top_hit": "59cad2a27ded5e2f1869132c",
"mycocosm_portal_id": "Aspni7"
},
"IMG_AP-1146261": {
"file_ids": ["595b9afd7ded5e5270eef127"],
"top_hit": "595a83767ded5e5270eeebc8"
},
"Phytozome-167": {
"file_ids": ["6643ff0653447aa389b9c859", "6643ff0753447aa389b9c865"],
"top_hit": "67b5004267ef7b237e865486"
}
},
"send_mail": false,
"api_version": "2"
}
This version is old, but still valid.
{
"ids": {
"IMG_SP-1060116":["5503e95a0d878525404e38d7"]
},
"api_version": "2"
}
This version is old, but still valid.
{
"ids": {
"IMG_SP-1060116":["5503e95a0d878525404e38d7","5f8b782e47675a20c850eda3"]
},
"api_version": "2"
}
This version is old, but still valid.
{
"ids": {
"IMG_SP-1060116":["5503e95a0d878525404e38d7","5f8b782e47675a20c850eda3"],
"IMG_SP-1060115":["55044bb00d878525404e3a3f"]
},
"api_version": "2"
}
curl -X POST "https://files-download.jgi.doe.gov/download_files/" -H "accept: application/json" -H "Authorization: {paste copied session token here}" -H "Content-Type: application/json" -d "{ \"ids\": { \"IMG_SP-1060116\":[ \"5503e95a0d878525404e38d7\", \"5f8b782e47675a20c850eda3\"], \"IMG_SP-1060115\":[\"55044bb00d878525404e3a3f\"]}, \"api_version\": \"2\"}" --output {enter your zip filename here}.zip
Scenario: You want to get an API call to find Fasta files for e coli:
Step 1: Enter e coli into the search bar on the JDP homepage
Step 2: Select the Fasta file type
Step 3: Press the API icon to copy the search query
Step 4: Paste the API call into your browser or terminal to view the JSON payload.
You should see the following: https://files.jgi.doe.gov/search/?q=e+coli&ff=%7B%22file_type%22:[%22fasta%22]%7D
Step 5: Parse the JSON payload to collect the file IDs you want to download.
Step 6: Construct your file restoration and download commands (we recommend using the Interactive Environment) and run the command in your terminal.
As of July 14, 2023, you can use the API to search for values within certain specific fields.
Use the f parameter (for "fields") to indicate a specific field you would like to search in conjunction with q (for "query").
Allowed values for f are:
srr
biosample
project_id
this is really more of a jgi_entity_id search parameter
this will search pre-2012 JGI project IDs (legacy project IDs), current generation JGI project IDs (Final Deliverable, Sequencing Project, Analysis Project, and Proposals)
library
this will allow you to search by JGI library names
img_taxon_oid
this will allow you to search by IMG Taxon OIDs
Example: https://files.jgi.doe.gov/search/?q=BWZZN&f=library&a=false&h=false&d=asc&p=1&x=10&api_version=2
Experiment with these parameters in the Interactive Environment.