Preparing sequence data for and performing de novo clustering/assembly

Post date: Sep 01, 2017 7:55:20 PM

We split the original fastq files by individual. The parsed and split fastq files are in king:/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/parsed/. This includes 617 (PSSP=diploid) and 549 (POSE) files/individuals for the two species.

We are using the cd-hit clustering algorithm to generate a reference genome. This is a replacement for vsearch. Here is what we did. This is for PSSP, Jacque will repeat this for POSE.

1. Extract unique sequences from each fastq file.

As an interactive job on gompert-kp, ran the following from the parsed sub-directory:

perl ../scripts/getUnique.pl PSSP_*

The full script is here. Note that it uses Parallel::ForkManager instead of the generic fork, this is much cleaner and lets one set the maximum number of children processes.

#!/usr/bin/perl

#

# prints the set of unique sequences from a fastq file; conversts to fasta with the header line set to the number of copies of that sequence

#

use warnings;

use strict;

use Parallel::ForkManager;

my $pm = Parallel::ForkManager->new(50); ## 50 concurrent processes

FILES:

foreach my $fastq (@ARGV){ ## loop through set of fastq files

$pm->start and next FILES; ## fork

open(IN, $fastq);

my $seq;

my %useq; ## has of unique sequences

while(<IN>){

chomp;

if(m/^\@P/){ ## on header line

$seq = <IN>;

chomp($seq);

if(defined $useq{$seq}){

$useq{$seq}++;

}

else{

$useq{$seq} = 1;

}

<IN>;

<IN>; ## burn away + and qualities

}

my $fasta = $fastq;

$fasta =~ s/q$/a/;

open(OUT, "> uni_$fasta") or die "couldn't write output for $fasta\n";

foreach my $s (sort keys %useq){

my @chars = ('0'..'9', 'A'..'F');

my $len = 48;

my $string;

while($len--){ $string .= $chars[rand @chars] };

print OUT ">$string $useq{$s}\n";

print OUT "$s\n";

}

close(IN);

close(OUT);

$pm->finish; ## exit the child process

}

$pm->wait_all_children;

2. Concatenate the 617 uni*fasta files generated from (1). Then run a second script to extract the unique reads across all individuals.

cat uni_PSSP_* > cat_uni_PSSP.fasta

perl ../scripts/getCatUni.pl cat_uni_PSSP.fasta

Here is the full script:

#!/usr/bin/perl

#

# prints the set of unique sequences from a the cat fasta files and prints sequences with counts

#

use warnings;

use strict;

my $in = shift(@ARGV);

open(IN, $in);

my $seq;

my %useq; ## has of unique sequences

while(<IN>){

chomp;

unless(m/^>/ or m/NNN/){ ## skip header lines and those with too many Ns

$useq{$_}++;

}

open(OUT, "> uni_$in") or die "couldn't write the outfile\n";

foreach my $s (sort keys %useq){

my @chars = ('0'..'9', 'A'..'F');

my $len = 48;

my $string;

while($len--){ $string .= $chars[rand @chars] };

print OUT ">$string $useq{$s}\n";

print OUT "$s\n";

}

close(IN);

close(OUT);

3. We used cd-hit-est (version 4.7) for clustering of sequences. To make this more tractable, we only worked with sequences present in at least two individuals. These are in sub2_uni_cat_uni_PSSP.fasta. We ran the clustering algorithm with a minimum match percentage of 90, 92 and 94%. Here is the slurm batch submission script (cluster.sh):

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=30

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=cdhit

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering

cd-hit-est -i sub2_uni_cat_uni_PSSP.fasta -o con90_sub2_uni_cat_uni_PSSP.fasta -M 0 -T 0 -c 0.90 -mask N -n 8

cd-hit-est -i sub2_uni_cat_uni_PSSP.fasta -o con92_sub2_uni_cat_uni_PSSP.fasta -M 0 -T 0 -c 0.92 -mask N -n 8

cd-hit-est -i sub2_uni_cat_uni_PSSP.fasta -o con94_sub2_uni_cat_uni_PSSP.fasta -M 0 -T 0 -c 0.94 -mask N -n 8

This generated 1762580, 2104710 and 2480658 clusters for 90, 92 and 94%, respectively. We decided to go with 90%.

4. We again used cd-hit-est to cluster the clusters against each other. Here we are trying to find paralogs that we want to remove from the reference. This was done at 80% (cluster2.sh):

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=30

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=cdhit

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering

cd-hit-est -i con90_sub2_uni_cat_uni_PSSP.fasta -o con90at80_sub2_uni_cat_uni_PSSP.fasta -M 0 -T 0 -c 0.80 -mask N -n 4

5. We next generated the set of ids (first bit of fasta header) for the sequences/clusters that clustered at 80%.

Get number of sequences per cluster for clustering @ 90 and for clustering of clusters @ 80.

perl cntClusterDepth.pl con90_sub2_uni_cat_uni_PSSP.fasta.clstr > clusterDepth90.txt

perl cntClusterDepth.pl con90at80_sub2_uni_cat_uni_PSSP.fasta.clstr > clusterDepth90at80.txt

Get ids of clusters that clustered at 80 and print list (= 915,945 sequences) to file

perl getClusters2drop.pl > bad_clusteredat80.txt

Here is the getClusters2drop.pl script:

#!/usr/bin/perl

#

# first reads a depth file, gets cluster numbers for clusters with > 1 reference seq, and prints their ids

#

$depthFile = "clusterDepth90at80.txt";

open(IN, $depthFile) or die "failed to read $depthFile\n";

while(<IN>){

chomp;

m/(\d+)\s+(\d+)/;

$cl = $1;

$cnt = $2;

if($cnt > 1){ ## bad cluster

$drop{$cl} = 1;

}

close(IN);

$clusterFile = "con90at80_sub2_uni_cat_uni_PSSP.fasta.clstr";

open(IN, $clusterFile) or die "failed to read $clusterFile\n";

while(<IN>){

chomp;

if(m/^>Cluster\s+(\d+)/){

$cl = $1;

if(defined $drop{$cl}){

$bad = 1;

}

else{

$bad = 0;

}

elsif($bad == 1){

m/>([A-Z0-9]+)/ or die "failed match $_\n";

print "$1\n";

}

close(IN);

6. We also searched for clusters from the original set (@90%) that looks suspicious, and that we also wanted to drop. We used several criteria: low mean sequence similarity, high variation in sequence similarity, or cluster of 1.

Grab data on %match for every sequence in the original (i.e., @90%) clusters:

perl getPrctMatch.pl > percentMatch.txt

Run R script summarizePrctMatch.R to get mean and sd % match by cluster, identify those with low mean or high sd that seem problematic.

## read in percent data, there are N rows, one per non-ref sequence

## 2 columns per row, cluster no. and %match

N<-27654508

dat<-matrix(scan("percentMatch.txt",n=N*2,sep=" "),nrow=N,ncol=2,byrow=TRUE)

## summarize, mean and sd % match per cluster

mns<-tapply(X=dat[,2],INDEX=dat[,1],mean)

sds<-tapply(X=dat[,2],INDEX=dat[,1],sd)

cnt<-tapply(X=is.na(dat[,2])==F,INDEX=dat[,1],sum)

plot(mns,sds)

sum(mns > 0.92 & sds < 5,na.rm=TRUE)

##[1] 912530

## 912,530 with mean > 92% and sd < 5%

xx<-which((mns < 0.92 | sds > 5) & cnt > 3)

## write contig numbers for bad sequences that do not meet these criteria

write.table(names(mns[xx]),file="bad_mnSd.txt",quote=F,row.names=F,col.names=F)

Get ids for these sets of 'bad' clusters.

perl getClusters2dropOriginal.pl

This flagged 3723 bad mean clusters and 582229 bad singleton cluster for removal.

Here is the script:

#!/usr/bin/perl

#

# works with original set of clusters (@90), drops those that are in the bad_mnSd.txt file and those with only one member, does so by reporting ids

#

$mnsdFile = "bad_mnSd.txt";

open(IN, $mnsdFile) or die "failed to read $mnsdFile\n";

while(<IN>){

chomp;

$drop{$_} = 1;

}

close(IN);

open(OUT, "> bad_original.txt") or die "failed to write\n";

$clusterFile = "con90_sub2_uni_cat_uni_PSSP.fasta.clstr";

open(IN, $clusterFile) or die "failed to read $clusterFile\n";

while(<IN>){

chomp;

if(m/^>Cluster\s+(\d+)/){

$cl = $1;

if(defined $drop{$cl}){

$bad = 1;

$badMn++;

}

else{

$bad = 0;

}

elsif($bad == 1){

m/>([A-Z0-9]+)/ or die "failed match $_\n";

print OUT "$1\n";

}

elsif(m/>([A-Z0-9]+)/){

$seq = $1;

$cnts{$cl}++;

if(m/\*$/){

$ref{$cl} = $seq;

}

else{

die "why am i here $_\n";

}

foreach $cl (keys %cnts){

if($cnts{$cl} == 1){

$badSingleton++;

print OUT "$ref{$cl}\n";

}

close(IN);

print "flagged $badMn bad mean clusters and $badSingleton bad singleton cluster for removal\n";

7. Make the fasta reference, dropping all of the bad clusters.

Combine lists of bad clusters (1,518,765, but not all unique)

cat bad_clusteredat80.txt bad_original.txt > bad_cat.txt

Drop sequences

perl makeReference.pl

Based on this we found 1252805 bad clusters and 509775 good clusters. These are in PSSP_reference_con90.fasta. This is the reference. And here is the makeReference.pl script.

#!/usr/bin/perl

#

# create a reference from the existing fasta/cluster set but dropping bad sequences

#

open(IN, "bad_cat.txt") or die;

while(<IN>){

chomp;

$bad{$_} = 1;

}

close(IN);

open(IN, "con90_sub2_uni_cat_uni_PSSP.fasta") or die "failed to read\n";

open(OUT, "> PSSP_reference_con90.fasta") or die "failed to write\n";

$cntGood = 0;

$cntBad = 0;

while(<IN>){

chomp;

if(m/^>/){

$id = substr($_, 1, 19);

if(!defined $bad{$id}){

print OUT "$_\n";

$seq = <IN>;

print OUT "$seq";

$cntGood++;

}

else {

<IN>;

$cntBad++;

$bad{$id} = 0;

}

close(IN);

close(OUT);

print "found $cntBad bad clusters and $cntGood good clusters\n";

We are now ready for the actual alignment, though we might want to filter against fungal endophytes and bacteria later.