Text in red indicates a warning that you must not defy.
Text in bold and green denotes that it requires modification for individual users i.e. do not simply copy and paste without properly adapting it first.
Text in bold and orange denote that is requires modification for At vs. Soy work.
User maintained software space can be requested through the HPC. For our course, the software is INSTRUCTOR maintained and is located path: /usr/local/usrapps/bitcpt. If you would like to use additional software beyond fastqc, star or salmon you will need to ask you instructor to install it as YOU do not have permission from the HP to install user maintained software.
To use user maintained software like fastqc, star or salmon you will need give the location of the software and it's executable file like this:
/usr/local/usrapps/bitcpt/fastqc/bin/fastqc
To check that fastqc has been successfully activated, you can print the fastqc help screen and also read useful options to use it right before employing it.
fastqc -h
fastqc sequence_file_name.fq.gz
so long as you are in the working directory where the data files are stored
fastqc /path/directory/subdirectory/sequence_file_name.fq.gz
giving the full path will always work no matter your current working directory
We won't be using either of these methods because we all are sharing our data and all of our output files would overwrite one another in the RawData directories. However if you did go about running one data file at a time this is an example of what your output would look like for one data file.
Ex: At-Leaf_L02_1.fq.gz
To run a fastqc on many files and specify a new location for their data output (which is what we will do) you will modify and use a script.
Use this link to download a copy of this text file to your Desktop: At.fastqc.sh
Let's take a look:
#!/bin/tcsh
#BSUB -J fastqc_At_GroupName #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastqc.%J.out #output file
#BSUB -e fastqc.%J.err #error file
# For running fastqc on all my Arabidopsis samples
# Run in working directory /share/bitcpt/S23/UnityID/At
# Must run this in working directory with subdirectory named /fastqc
# -t specifies number of threads
/usr/local/usrapps/bitcpt/fastqc/bin/fastqc /share/bitcpt/S23/RawData/Arabidopsis_thaliana/* -t 20 -outdir ./fastqc
Line 1:#!/bin/tcsh
This is the LFS header with:
the shbang (#!)
/bin/tcsh for working on the LFS
Line 2: #BSUB -J fastqc_At_GroupName #job name
#BSUB is for LFS scripts
-J indicates the job name, as indicated by the hash note #job name
Line 3: #BSUB -n 20 #number of nodes
#BSUB is for LFS scripts
-n 20 indicates use of 20 nodes, as indicated by the hash note #number of nodes
Line 4: #BSUB -W 2:0 #time for job to complete
#BSUB is for LFS scripts
-W indicates wall clock time of 2 hours, as indicated by the hash note #time for job to complete
Line 5: #BSUB -o fastqc.%J.out #output file
#BSUB is for LFS scripts
-o indicates where the output files should be written, as indicated by the hash note #output file
Line 6: #BSUB -e fastqc.%J.err #error file
#BSUB is for LFS scripts
-e indicates where the error file should be written, as indicated by the hash note #error file
Line 7:
blank for aesthetic privilege
Line 8:# For running fastqc on all my Arabidopsis samples
Line 9: # Run in working directory /share/bitcpt/S23/UnityID/At
Line 10: # Must run this in working directory with subdirectory named /fastqc
Lines 8-10 are hash (#) notes for our future selves indicating key elements of the script and working directory for it to be run successfully.
Line 11:
blank for aesthetic privilege
Line 12: # -t specifies number of threads
hash (#) note for our future self about the -t in the next line
Line 13: /usr/local/usrapps/bitcpt/fastqc/bin/fastqc /share/bitcpt/S23/RawData/Arabidopsis_thaliana/* -t 20 -outdir ./fastqc
/usr/local/usrapps/bitcpt/fastqc/bin/fastqc where the command is installed
/share/bitcpt/S23/RawData/Arabidopsis_thaliana* = input file path; * = wildcard character signifying ALL
-t 20 = option to use 20 threads
-outdir = option to indicate where the output files will be generated, requires directory to be specified
./fastqc = specified path for the directory where the output files will be generated
Here's what the script looks like. You can copy and paste this directly into a script (not the head node!)
#!/bin/tcsh
#BSUB -J fastqc_At #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastqc.%J.out #output file
#BSUB -e fastqc.%J.err #error file
# For running fastqc on all my Arabidopsis samples
# Run in working directory /share/bitcpt/S23/UnityID/At
# Must run this in working directory with subdirectory named /fastqc
# -t specifies number of threads
/usr/local/usrapps/bitcpt/fastqc/bin/fastqc /share/bitcpt/S23/RawData/Arabidopsis_thaliana/* -t 20 -outdir ./fastqc
File transfer fastqc script from local machine to Henry2
Transfer the At.fastqc.sh text file to your At project working directory /share/bitcpt/S23/UnityID/At
Method 1: Use Globus Personal Connect to transfer the file from your Desktop to your At project working directory
Method 2: use the scp command working from your local machine and transfer a file from your local machine to HPC. You will have to enter your login password and 2-step verify in order for the transfer to be successful which is tedious.
scp ./Desktop/At.fastqc.sh UnityID@login.hpc.ncsu.edu:/share/bitcpt/S23/UnityID/At
Example:
scp ./Desktop/At.fastqc.sh casjogre@login.hpc.ncsu.edu:/share/bitcpt/S23/casjogre/At
You can use also the scp command working from your local machine and transfer a file from the HPC to your local machine. The dot is your working directory on your local machine.
scp casjogre@login.hpc.ncsu.edu:/share/bitcpt/S23/casjogre/Ha/fastqc/*.html .
Check your At working directory for successful file transfer
As a reminder you should be located in your /share/bitcpt/S23/UnityID/At directory
tree
.
├── AlignedToTranscriptome
├── At.fastqc.sh
├── fastqc
├── salmon_align_quant
├── starindices
├── starOutputfiles
└── transcriptome
Run your FastQC script
To run this script and perform FastQC on your raw data you need to submit a job to the LSF using the command bsub indicating what to run
bsub <At.fastqc.sh
You should get a return that looks similar to this:
Job <######> is submitted to default queue <short>.
Both the job #'s and the queue can vary every time you run a job on the LSF.
Now the script is being processed by the HPC. (FYI You could log out of terminal, your computer could crash and this job would keep on running.)
Check out the jobs that you have running by using the command bjobs
bjobs
You should get a return that looks similar to this:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
615645 casjogr RUN short login02 20*bc3i4 *GroupName Mar 29 16:18
If you list what is in your working directory you should see two new files as the job runs:
fastqc.err.<######>
fastqc.out.<######>
You can view these files before the job is complete. Let's start with the fastqc.err.<######>
more fastqc.err.<######>
Hitting the space bar will allow for you to continue to see more of the file
You should see which raw data files have started to run fastqc and progress on the run at 5% intervals:
Started analysis of At-Leaf1_L02_1.fq.gz
Approx 5% complete for At-Leaf1_L02_2.fq.gz
Approx 10% complete for At-Leaf1_L02_2.fq.gz
Download your FastQC reports for analysis
Use Globus to download the HTML versions of the reports for easy viewing
Use Globus to download the zip files which contain 4 files and 2 folders:
fastqc_data.txt
fastqc_report.html
fastqc.fo
summary.txt
Icons (folder)
Images (folder)
Now its your turn to apply what you have learned.
Set up your working directory with the proper subdirectories
Adapt a new script and run it for your raw sequence data for Soybean!
Everyone learns this at different paces:
Be helpful to one another and help explain things if you are learning quickly.
Ask lots of questions if you are not understanding the steps we are taking.
Remember that we can all have a growth mindset and that these skills will sharpen as we continue to practice them!
Since everyone on your research team will be using the same dataset for your Team analysis and your individual analysis, this week does not have any follow up actions for your Final Portfolio assignment! (woo!)
1) Make sure the directory