Post date: Jun 10, 2018 4:28:44 PM
We generated DNA sequence and CHC data for 192 caterpillars to map CHCs. The data are described here. The DNA sequence data are from the NextSeq on campus and are at /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/. The full data set is Undetermined_S0_R1_001.fastq, and a version where Aaron tried to removed PhiX is Gompert_Lyc46-47.fastq.
The cleaned version looks good overall on fastqc. We have 554,304,319 151 bp sequences. Quality scores stay above 30 on average out to 150 bp (though they decline some post 100). There is a notable increase in the %G towards the end of the reads. This reflects the new 2-dye chemistry (basically missing data converts to a high quality G). Here is a nice description. I will write a new script to deal with this.
Here is what I have done.
0. No PhiX removal, I am using what Aaron did.
1. Split the fastq files and parse the barcodes
split -l 200000000 Gompert_Lyc46-47.fastq Gomp24 &
Than I ran SubPars.sh (in the Scripts subdirectory).
#!/bin/sh
#SBATCH --time=144:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=parseseq
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load perl
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs
perl RunParseFork.pl Gomp24*
This runs the fork script, which runs parse_barcodes768.pl.
#!/usr/bin/perl
#
# manage barcode parsing
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
my $bc = "barcodes_lyc46-47.csv";
FILES:
foreach $fq (@ARGV){
$pm->start and next FILES; ## fork
print "Parsing $fq with $bc";
system "perl parse_barcodes768.pl $bc $fq\n";
$pm->finish;
}
$pm->wait_all_children;
This generated the parsed_Gomp24a* files in the Parsed subdirectory.
2. Next I generated individual fastq files using the splitFastq.pl script.
perl ../Scripts/splitFastq.pl ids.txt parsed_Gomp24a*
3. Clean-up polyG issue. Note that having the Gs at the end (often with As before them) was causing illumina adaptors to be missed by my parse barcodes script. Thus, I made a script to remove Gs and adaptors followed by As and Gs. With an interactive job I ran.
perl ../Scripts/RunRemovePolyG.pl [A-Z][A-Z][A-Z]\-*fastq
which runs
#!/usr/bin/perl
#
# manage polyG and adaptor removal
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $fq (@ARGV){
$pm->start and next FILES; ## fork
print "Cleaning $fq";
system "perl ../Scripts/RemovePolyG.pl $fq\n";
$pm->finish;
}
$pm->wait_all_children;
and these runs the main script RemovePolyG.pl on each fastq making clean_*fastq files.
#!/usr/bin/perl
#
# Script removes polyG tails that could reflect missing signal from 2-dye chemistry as is used with NextSeq
#
my $in = shift(@ARGV);
my $out = "clean_$in";
open(IN, $in) or die "failed to read the infile: $in\n";
open(OUT, "> $out") or die "failed to write the outfile: $out\n";
while(<IN>){
## deal with sequence
$seq = <IN>;
chomp($seq);
$fail = 0;
$L = length($seq);
if($seq =~ s/(TTAC[ACGNT]{30}CTTGA{1,}.*$)//){
$ln = length($1);
$cut = 1;
if ($seq == $L){
$fail = 1;
}
}
elsif($seq =~ s/(G{5,})$//){ ## removed poly G
$ln = length($1);
$cut = 1;
if ($seq == $L){
$fail = 1;
}
}
else{
$cut = 0;
}
if($fail == 0){
print OUT "$_"."$seq\n";
$hdr2 = <IN>;
$qual = <IN>;
chomp($qual);
if($cut==0){
print OUT "$hdr2"."$qual\n";
}
else{
$Lkept = $L -$ln;
$qualKept = substr($qual,0,$Lkept);
print OUT "$hdr2"."$qualKept\n";
}
}
else{
<IN>;
<IN>;
}
}
close(IN);
close(OUT);