FastQC

FASTQC SOFTWARE OVERVIEW

To run fastqc on a single file you can use one of these codes:

FASTQC JOB SCRIPT

RUNNING FASTQC ON ARABIDOPSIS RAW SEQUENCE DATA

RUNNING FASTQC ON SOYBEAN RAW SEQUENCE DATA

FOR YOUR FINAL PORTFOLIO

Soybean FastQC script hints

All teams should upload their html files into this shared Google Drive folder.

Text in red indicates a warning that you must not defy.

Text in bold and green denotes that it requires modification for individual users i.e. do not simply copy and paste without properly adapting it first.

Text in bold and orange denote that is requires modification for At vs. Soy work.

FASTQC SOFTWARE OVERVIEW

User maintained software space can be requested through the HPC. For our course, the software is INSTRUCTOR maintained and is located path: /usr/local/usrapps/bitcpt. If you would like to use additional software beyond fastqc, star or salmon you will need to ask you instructor to install it as YOU do not have permission from the HP to install user maintained software.

To use user maintained software like fastqc, star or salmon you will need give the location of the software and it's executable file like this:

/usr/local/usrapps/bitcpt/fastqc/bin/fastqc

To check that fastqc has been successfully activated, you can print the fastqc help screen and also read useful options to use it right before employing it.

fastqc -h

$ fastqc /share/bitcpt/Spring2022/RawDataArabidopsis/At-Leaf1_L02_1.fq.gz Started analysis of At-Leaf1_L02_1.fq.gz Approx 5% complete for At-Leaf1_L02_1.fq.gz Approx 10% complete for At-Leaf1_L02_1.fq.gz Approx 15% complete for At-Leaf1_L02_1.fq.gz Approx 20% complete for At-Leaf1_L02_1.fq.gz Approx 25% complete for At-Leaf1_L02_1.fq.gz Approx 30% complete for At-Leaf1_L02_1.fq.gz Approx 35% complete for At-Leaf1_L02_1.fq.gz Approx 40% complete for At-Leaf1_L02_1.fq.gz Approx 45% complete for At-Leaf1_L02_1.fq.gz Approx 50% complete for At-Leaf1_L02_1.fq.gz Approx 55% complete for At-Leaf1_L02_1.fq.gz Approx 60% complete for At-Leaf1_L02_1.fq.gz Approx 65% complete for At-Leaf1_L02_1.fq.gz Approx 70% complete for At-Leaf1_L02_1.fq.gz Approx 75% complete for At-Leaf1_L02_1.fq.gz Approx 80% complete for At-Leaf1_L02_1.fq.gz Approx 85% complete for At-Leaf1_L02_1.fq.gz Approx 90% complete for At-Leaf1_L02_1.fq.gz Approx 95% complete for At-Leaf1_L02_1.fq.gz Analysis complete for At-Leaf1_L02_1.fq.gz

To run fastqc on a single file you can use one of these codes:

fastqc sequence_file_name.fq.gz

so long as you are in the working directory where the data files are stored

fastqc /path/directory/subdirectory/sequence_file_name.fq.gz

giving the full path will always work no matter your current working directory

We won't be using either of these methods because we all are sharing our data and all of our output files would overwrite one another in the RawData directories. However if you did go about running one data file at a time this is an example of what your output would look like for one data file.

Ex: At-Leaf_L02_1.fq.gz

FASTQC JOB SCRIPT

To run a fastqc on many files and specify a new location for their data output (which is what we will do) you will modify and use a script.
Use this link to download a copy of this text file to your Desktop: At.fastqc.sh
Let's take a look:
1. #!/bin/tcsh
2. #BSUB -J fastqc_At_GroupName #job name
3. #BSUB -n 20 #number of nodes
4. #BSUB -W 2:0 #time for job to complete
5. #BSUB -o fastqc.%J.out #output file
6. #BSUB -e fastqc.%J.err #error file
8. # For running fastqc on all my Arabidopsis samples
9. # Run in working directory /share/bitcpt/S23/UnityID/At
10. # Must run this in working directory with subdirectory named /fastqc
12. # -t specifies number of threads
13. /usr/local/usrapps/bitcpt/fastqc/bin/fastqc /share/bitcpt/S23/RawData/Arabidopsis_thaliana/* -t 20 -outdir ./fastqc

Line 1:#!/bin/tcsh
- This is the LFS header with:
  - the shbang (#!)
  - /bin/tcsh for working on the LFS
Line 2: #BSUB -J fastqc_At_GroupName #job name
- #BSUB is for LFS scripts
- -J indicates the job name, as indicated by the hash note #job name
Line 3: #BSUB -n 20 #number of nodes
- #BSUB is for LFS scripts
- -n 20 indicates use of 20 nodes, as indicated by the hash note #number of nodes
Line 4: #BSUB -W 2:0 #time for job to complete
- #BSUB is for LFS scripts
- -W indicates wall clock time of 2 hours, as indicated by the hash note #time for job to complete
Line 5: #BSUB -o fastqc.%J.out #output file
- #BSUB is for LFS scripts
- -o indicates where the output files should be written, as indicated by the hash note #output file
Line 6: #BSUB -e fastqc.%J.err #error file
- #BSUB is for LFS scripts
- -e indicates where the error file should be written, as indicated by the hash note #error file
Line 7:
- blank for aesthetic privilege
Line 8:# For running fastqc on all my Arabidopsis samples
Line 9: # Run in working directory /share/bitcpt/S23/UnityID/At
Line 10: # Must run this in working directory with subdirectory named /fastqc
- Lines 8-10 are hash (#) notes for our future selves indicating key elements of the script and working directory for it to be run successfully.
Line 11:
- blank for aesthetic privilege
Line 12: # -t specifies number of threads
- hash (#) note for our future self about the -t in the next line
Line 13: /usr/local/usrapps/bitcpt/fastqc/bin/fastqc /share/bitcpt/S23/RawData/Arabidopsis_thaliana/* -t 20 -outdir ./fastqc
- /usr/local/usrapps/bitcpt/fastqc/bin/fastqc where the command is installed
- /share/bitcpt/S23/RawData/Arabidopsis_thaliana* = input file path; * = wildcard character signifying ALL
- -t 20 = option to use 20 threads
- -outdir = option to indicate where the output files will be generated, requires directory to be specified
- ./fastqc = specified path for the directory where the output files will be generated

Here's what the script looks like. You can copy and paste this directly into a script (not the head node!)

#!/bin/tcsh

#BSUB -J fastqc_At #job name

#BSUB -n 20 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o fastqc.%J.out #output file

#BSUB -e fastqc.%J.err #error file

# For running fastqc on all my Arabidopsis samples

# Run in working directory /share/bitcpt/S23/UnityID/At

# Must run this in working directory with subdirectory named /fastqc

# -t specifies number of threads

/usr/local/usrapps/bitcpt/fastqc/bin/fastqc /share/bitcpt/S23/RawData/Arabidopsis_thaliana/* -t 20 -outdir ./fastqc

RUNNING FASTQC ON ARABIDOPSIS RAW SEQUENCE DATA

File transfer fastqc script from local machine to Henry2

Transfer the At.fastqc.sh text file to your At project working directory /share/bitcpt/S23/UnityID/At
- Method 1: Use Globus Personal Connect to transfer the file from your Desktop to your At project working directory
- Method 2: use the scp command working from your local machine and transfer a file from your local machine to HPC. You will have to enter your login password and 2-step verify in order for the transfer to be successful which is tedious.

scp ./Desktop/At.fastqc.sh UnityID@login.hpc.ncsu.edu:/share/bitcpt/S23/UnityID/At

Example:

scp ./Desktop/At.fastqc.sh casjogre@login.hpc.ncsu.edu:/share/bitcpt/S23/casjogre/At

- - You can use also the scp command working from your local machine and transfer a file from the HPC to your local machine. The dot is your working directory on your local machine.

scp casjogre@login.hpc.ncsu.edu:/share/bitcpt/S23/casjogre/Ha/fastqc/*.html .

Check your At working directory for successful file transfer
- As a reminder you should be located in your /share/bitcpt/S23/UnityID/At directory

tree

├── AlignedToTranscriptome

├── At.fastqc.sh

├── fastqc

├── salmon_align_quant

├── starindices

├── starOutputfiles

└── transcriptome

Run your FastQC script
- To run this script and perform FastQC on your raw data you need to submit a job to the LSF using the command bsub indicating what to run

bsub <At.fastqc.sh

You should get a return that looks similar to this:

Job <######> is submitted to default queue <short>.

- - Both the job #'s and the queue can vary every time you run a job on the LSF.
Now the script is being processed by the HPC. (FYI You could log out of terminal, your computer could crash and this job would keep on running.)
Check out the jobs that you have running by using the command bjobs

bjobs

- You should get a return that looks similar to this:

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

615645 casjogr RUN short login02 20*bc3i4 *GroupName Mar 29 16:18

- If you list what is in your working directory you should see two new files as the job runs:
  - fastqc.err.<######>
  - fastqc.out.<######>
- You can view these files before the job is complete. Let's start with the fastqc.err.<######>

more fastqc.err.<######>

- Hitting the space bar will allow for you to continue to see more of the file
- You should see which raw data files have started to run fastqc and progress on the run at 5% intervals:
  - 1. 1. Started analysis of At-Leaf1_L02_1.fq.gz
        Approx 5% complete for At-Leaf1_L02_2.fq.gz
        Approx 10% complete for At-Leaf1_L02_2.fq.gz

Download your FastQC reports for analysis
- Use Globus to download the HTML versions of the reports for easy viewing
- Use Globus to download the zip files which contain 4 files and 2 folders:
  - fastqc_data.txt
  - fastqc_report.html
  - fastqc.fo
  - summary.txt
  - Icons (folder)
  - Images (folder)

RUNNING FASTQC ON SOYBEAN RAW SEQUENCE DATA

Now its your turn to apply what you have learned.
Set up your working directory with the proper subdirectories
Adapt a new script and run it for your raw sequence data for Soybean!
Everyone learns this at different paces:
- Be helpful to one another and help explain things if you are learning quickly.
- Ask lots of questions if you are not understanding the steps we are taking.
- Remember that we can all have a growth mindset and that these skills will sharpen as we continue to practice them!

FOR YOUR FINAL PORTFOLIO

Since everyone on your research team will be using the same dataset for your Team analysis and your individual analysis, this week does not have any follow up actions for your Final Portfolio assignment! (woo!)

Soybean FastQC script hints

1) Make sure the directory

All teams should upload their html files into this shared Google Drive folder.

Page updated

Report abuse