Sequence data for CHC 2018 project, filtering and parsing

Post date: Jun 10, 2018 4:28:44 PM

We generated DNA sequence and CHC data for 192 caterpillars to map CHCs. The data are described here. The DNA sequence data are from the NextSeq on campus and are at /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/. The full data set is Undetermined_S0_R1_001.fastq, and a version where Aaron tried to removed PhiX is Gompert_Lyc46-47.fastq.

The cleaned version looks good overall on fastqc. We have 554,304,319 151 bp sequences. Quality scores stay above 30 on average out to 150 bp (though they decline some post 100). There is a notable increase in the %G towards the end of the reads. This reflects the new 2-dye chemistry (basically missing data converts to a high quality G). Here is a nice description. I will write a new script to deal with this.

Here is what I have done.

0. No PhiX removal, I am using what Aaron did.

1. Split the fastq files and parse the barcodes

split -l 200000000 Gompert_Lyc46-47.fastq Gomp24 &

Than I ran SubPars.sh (in the Scripts subdirectory).

#!/bin/sh

#SBATCH --time=144:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=parseseq

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

module load perl

cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs

perl RunParseFork.pl Gomp24*

This runs the fork script, which runs parse_barcodes768.pl.

#!/usr/bin/perl

# manage barcode parsing

use Parallel::ForkManager;

my $max = 20;

my $pm = Parallel::ForkManager->new($max);

my $bc = "barcodes_lyc46-47.csv";

FILES:

foreach $fq (@ARGV){

$pm->start and next FILES; ## fork

print "Parsing $fq with $bc";

system "perl parse_barcodes768.pl $bc $fq\n";

$pm->finish;

}

$pm->wait_all_children;

This generated the parsed_Gomp24a* files in the Parsed subdirectory.

2. Next I generated individual fastq files using the splitFastq.pl script.

perl ../Scripts/splitFastq.pl ids.txt parsed_Gomp24a*

3. Clean-up polyG issue. Note that having the Gs at the end (often with As before them) was causing illumina adaptors to be missed by my parse barcodes script. Thus, I made a script to remove Gs and adaptors followed by As and Gs. With an interactive job I ran.

perl ../Scripts/RunRemovePolyG.pl [A-Z][A-Z][A-Z]\-*fastq

which runs

#!/usr/bin/perl

# manage polyG and adaptor removal

use Parallel::ForkManager;

my $max = 20;

my $pm = Parallel::ForkManager->new($max);

FILES:

foreach $fq (@ARGV){

$pm->start and next FILES; ## fork

print "Cleaning $fq";

system "perl ../Scripts/RemovePolyG.pl $fq\n";

$pm->finish;

}

$pm->wait_all_children;

and these runs the main script RemovePolyG.pl on each fastq making clean_*fastq files.

#!/usr/bin/perl

# Script removes polyG tails that could reflect missing signal from 2-dye chemistry as is used with NextSeq

my $in = shift(@ARGV);

my $out = "clean_$in";

open(IN, $in) or die "failed to read the infile: $in\n";

open(OUT, "> $out") or die "failed to write the outfile: $out\n";

while(<IN>){

## deal with sequence

$seq = <IN>;

chomp($seq);

$fail = 0;

$L = length($seq);

if($seq =~ s/(TTAC[ACGNT]{30}CTTGA{1,}.*$)//){

$ln = length($1);

$cut = 1;

if ($seq == $L){

$fail = 1;

}

elsif($seq =~ s/(G{5,})$//){ ## removed poly G

$ln = length($1);

$cut = 1;

if ($seq == $L){

$fail = 1;

}

else{

$cut = 0;

}

if($fail == 0){

print OUT "$_"."$seq\n";

$hdr2 = <IN>;

$qual = <IN>;

chomp($qual);

if($cut==0){

print OUT "$hdr2"."$qual\n";

}

else{

$Lkept = $L -$ln;

$qualKept = substr($qual,0,$Lkept);

print OUT "$hdr2"."$qualKept\n";

}

else{

<IN>;

}

close(IN);

close(OUT);

Page updated

Google Sites

Report abuse