MICRO 101 @ SJSU - Working with SRA single-end data

Working with SRA Single-end Data

NCBI SRA is a public repository for storing sequencing data. The following is an example of obtaining single-end sequencing data from SRA and perform metagenomic analysis using Galaxy.

Obtain sequencing data from SRA into Galaxy

Here is an SRA record containing a collection of sequencing data: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP133149&o=acc_s%3Aa

How to Download Sequence Data. Check the boxes for the data files you want to download and click "Accession List" under the "Selected" section to download the list of SRA Accession IDs. Save the text (.txt) file to your computer, open the file with a text reader. You should see a list with the SRR Accession IDs you selected (image to the right).

2. How to Upload the list of SRA Accesssion ID to your Galaxy Project

Clicking the arrow button on the upper left hand corner of your galaxy project page.

Click "Paste/Fetch Data" button. Copy the SRA accession ID from the text file to the text box and then click "Start". In this example, only 4 SRA accession IDs are used. A text file will appear in the right panel named "Pasted Entry" that contains the list of accession IDs.

3. Extract the FASTQ sequencing data from SRA.

Under the Tools, type SRA in the search tools text box.
Select Faster Download and Extract Reads in FASTQ format from NCBI SRA from the list.
Under select input type, select List of SRA acession, one per line.
Under sra accession list, select Pasted Entry, which contains the list SRA Accession IDs obtained from Step 2 above.
Leave the rest of the options as default, and click Execute.

On the right side panel, you should see four groups of output generated. If the sequencing data are from a single-end sequencing experiment, your FASTQ files will be under Single-end data (fasterq-dump). If the sequencing was performed in a paired-end fashion, you will find the FASTQ data in Pair-end data (fasterq-dump). Since our data is from single-end sequencing experiment, we will continue our data analysis using the FASTQ files in Single-end data (fasterq-dump).

Data Quality Control

Create FASTA files from single-end reads
Data Cleaning
Optimize files for computation

1. Create FASTA files from single-end reads

In the main tutorial section, we need to combine the forward and reverse reads into a single sequence from a paired-end experiment using Make.contigs. However, for single-end reads, we don't need to perform that step, but will need to convert the FASTQ files into FASTA format.

Under Tools search box on the left panel, type Fastq.info, and click on Fastq.info under Mothur.

Select the folder icon under fastq - Fastq Sequence file.
From the dropdown, select Single-end data (fasterq-dump).
Click Execute.

The resulting FASTQ files are now stored under Fastq.info on collection ... : fasta. If you click on it, you will see the SRA Accession IDs. Each of them contains the FASTA file. Click on the eye icon to examine the fasta file.

You should see that each sequence starts with a FASTA header ">SRR...." followed by the nucleotide sequence.

2. Clean Data

2.1 Get summary statistics of the contigs

Enter summary.seqs under Tools search box on the left panel.
Click on Summary.seqs under Mothur.
Under fasta - Dataset, click on the folder icon. From the dropdown, select Fastq.info on collection... : fasta
Select Yes under Output logfile?
Click Execute

On the right panel, you can find the summary.seqs results. Click on Summary.seqs on collection ... : logfile, you will see the SRA accession IDs, click on one of them to see the result for that particular sample by clicking the eye icon. The detail of the results will be displayed in the center panel. Scroll to the bottom of the page, you will see the statistics of the sequencing reads.

2.2 Filter reads based on quality and length

Type Screen.seqs under the Tool search box.
Select Screen.seqs under Mothur.
Under fasta - Fasta to screen, click the folder icon. From the dropdown, select Fastq.info on collection ... : fasta
Under maxlength - Remove sequences longer than ..., enter the max sequence length you want to keep, in this example, we will enter 469, based on the output from summary.seqs.
Under maxambig - Remove sequences with ambiguous bases ..., enter 0.
Leave the rest of the settings as default.
Click Execute

The Screen.seqs results will appear on the right panel.
Click on Screen.seqs across collection, you will see the SRA accession IDs, click on one of them to see the results for that one sample.
We see that SRR6752086 fasta file has 13,916 sequences. If you look at fasta result we got from Fastq.info on collection ... : fasta from Step 1, there were 15,140 reads. We have successfully removed the unwanted sequences using Screen.seqs.

3: Optimize files for computing

Microbiome samples typically contain a large numbers of the same organism, and therefore we expect to find many identical sequences in our data. In order to speed up computation, we first determine the unique reads, and then record how many times each of these different reads was observed in the original dataset. We do this by using the Unique.seqs tool.

Remove duplciate sequences

Type Unique.seqs under the Tool search box on the left panel.
Select Unique.seqs under Mother.
Under fasta - Sequences to filter, click the folder icon. From the dropdown, select Screen.seqs across collection ...
Under output format, select Name file from the dropdown.
Click Execute.

The results from Unique.seqs will appear on the right panel. Select Unique.seq collection on ... : fasta. You will see the SRA accession IDs. Click on one of them to see the detail. In this example, SRR6752086's duplicate reads are removed and only 9,866 sequences will be used in the downstream metagenomic analysis.

Since our analysis results for each sample are stored in individual folder labeled with the SRA accession IDs, in the downstream analysis steps, select the folder icon to find the data for analysis. The data will be analyzed in "batch mode" where each sample will be analyzed simultaneous as a separate job.

Page updated

Report abuse