Post date: Jul 06, 2018 12:28:28 AM
0. No PhiX filtering since Aaron already did this.
0.5 Create ids.txt file from barcode_*.csv files
cat barcode_18-21.csv barcode_22-25.csv > ids.txt
Edit headers out
awk '{print $NF}' FS=, ids.txt > id.txt
1. Split fastq files. This is done to speed up parsing by parallelization. Splits should be divisible by 4 since each read takes up 4 lines.
Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/fastq/raw
split -l 200000000 Cal_018-021.fastq Rego1 &
split -l 200000000 Cal_022-025.fastq Rego2 &
This produces several series of files Rego1a* and Rego2a* with 200 million lines in each file.
2. Parse Rego1 and Rego2 files.
sbatch SubParse1.sh
sbatch SubParse2.sh
Each .sh runs RunParseFork[1 or 2].pl on Rego[1 or 2]*, respectively.
Input files for RunParseFork.pl are barcode_18-21.csv and barcode_22-25.csv
RunParseFork.pl will also run parse_barcodes768.pl
Produces parsed_Rego[1/2]a* files. These are moved to: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed
3. Generate individual fastq files
Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed
In an interactive job, run perl ../scripts/splitFastq.pl ids.txt parsed_*
Produces [Individual ID].fastq files.
I removed the file for L14A-2-4-1.fastq, L1-3-7-2.fastq, L2-10-9-12.fastq, and L2-2-5-9.fastq since these IDs were duplicated on the written spreadsheet.
3.5 Check reads to make sure they're all about equal
wc -l L* > reads1.txt
awk '!($2="")' reads1.txt
awk '!($2="")' reads1.txt > reads.txt
I removed the following for having too few reads (see reads_sum_zoom.pdf)
L14A:
12-2-10
12-2-21
12-2-28
3-4-21
3-5-3
4-7-14
L1:
5-2-5
6-6-10
6-7-9
L2:
12-1-17
2-2-4
3-9-1
After this, I have files for 748 individuals.
L14 - 251
L1 - 241
L2 - 256
4. Clean polyG tails (current step)
perl ../scripts/RunRemovePolyG.pl L*.fastq
Requires RemovePolyG.pl
Produces clean_L*.fastq files for every individual