fastq splitting, parsing, cleaning

Post date: Jul 06, 2018 12:28:28 AM

0. No PhiX filtering since Aaron already did this.

0.5 Create ids.txt file from barcode_*.csv files

cat barcode_18-21.csv barcode_22-25.csv > ids.txt

Edit headers out

awk '{print $NF}' FS=, ids.txt > id.txt

1. Split fastq files. This is done to speed up parsing by parallelization. Splits should be divisible by 4 since each read takes up 4 lines.

Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/fastq/raw

split -l 200000000 Cal_018-021.fastq Rego1 &

split -l 200000000 Cal_022-025.fastq Rego2 &

This produces several series of files Rego1a* and Rego2a* with 200 million lines in each file.

2. Parse Rego1 and Rego2 files.

sbatch SubParse1.sh

sbatch SubParse2.sh

Each .sh runs RunParseFork[1 or 2].pl on Rego[1 or 2]*, respectively.

Input files for RunParseFork.pl are barcode_18-21.csv and barcode_22-25.csv

RunParseFork.pl will also run parse_barcodes768.pl

Produces parsed_Rego[1/2]a* files. These are moved to: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed

3. Generate individual fastq files

Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed

In an interactive job, run perl ../scripts/splitFastq.pl ids.txt parsed_*

Produces [Individual ID].fastq files.

I removed the file for L14A-2-4-1.fastq, L1-3-7-2.fastq, L2-10-9-12.fastq, and L2-2-5-9.fastq since these IDs were duplicated on the written spreadsheet.

3.5 Check reads to make sure they're all about equal

wc -l L* > reads1.txt

awk '!($2="")' reads1.txt

awk '!($2="")' reads1.txt > reads.txt

I removed the following for having too few reads (see reads_sum_zoom.pdf)

L14A:

12-2-10

12-2-21

12-2-28

3-4-21

3-5-3

4-7-14

L1:

5-2-5

6-6-10

6-7-9

L2:

12-1-17

2-2-4

3-9-1

After this, I have files for 748 individuals.

L14 - 251

L1 - 241

L2 - 256

4. Clean polyG tails (current step)

perl ../scripts/RunRemovePolyG.pl L*.fastq

Requires RemovePolyG.pl

Produces clean_L*.fastq files for every individual

Page updated

Google Sites

Report abuse