Preparing sequence data for and performing de novo clustering/assembly
Split the original fastq files by individual. The parsed and split fastq files are in CPCH on the kingspeak cluster.
The pathway is: /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/parsed/
There are 549 POSE files/individuals. POSE is a polyploid, specifically a ducadecaploid. I used the cd-hit clustering algorithm (Fu et al. 2012) to generate a reference genome.
Extract unique sequnces from each fastq file. As an interactive job on gompert-kp node, I ran the following from the parsed sub-directory. Interactive job: srun --time=5:00:00 --ntastsk=5 --nodes=1 --account=gompert-kp --partition=gomert-kp --pty /bin/bash -l
I ran this code to run the perl script (see GrassGenVar for more info about the perl scripts) created by Zach, to extract unique sequences from each POSE fastq file: perl ../scripts/getUni.pl POSE_*
Concatenate POSE files: cat uni_POSE_* > cat_uni_POSE.fastq
Extract the unique reads across all individuals: perl ../scripts/getCatUni.pl cat_uni_POSE.fasta
I ran several batch scripts for clustering. Since POSE is a polyploid, I ran cd-hit to cluster sequences from 80-94% by two's. All clustered files are found in my home directory under the sub-directory POSEClustering. The pathway is: /uufs/chpc.utah.edu/common/home/u1065430/POSEClustering/. We used POSEcluster.sh as an example batch script to run a batch job on the cluster:
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=cdhit
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jacquelinepena412@gmail.com
# cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering
./../cd-hit/cd-hit-est -i sub2_uni_cat_uni_POSE.fasta -o con80_sub2_uni_cat_uni_POSE.fasta -M 0 -T 0 -c 0.80 -mask N -n 8
For each cluster, I ran this unix command grep -c "^>Clus" con80_sub2_uni_cat_uni_POSE.fasta.clstr to get the number of clusters.The number of clusters found for each clustering are: at 80% there are 239421 clusters, at 82% 253685, at 84% 272332, at 86% 293142, at 88% 314056, at 90% 350525, at 92% 406567, and at 94% 467254. These can all be found in the text file, ClusterNumbers.txt in POSEClustering sub-directory.
For each cluster, I ran this unix command grep "^>" con86_sub2_uni_cat_uni_POSE.fasta | cut -f 2 -d " " to get the number of sequences in each cluster. Each clustering has its own text file (ex: 80ClusterReads.txt), which is found in the POSEClustering sub-directory.
For each cluster, I grabbed data on the percent match for every sequence in the original clusters. The perl script can be found in getCltrPrctMatch.pl. I ran this perl command: perl getCltrPrctMatch.pl > 80CltrPrctMatch.txt.
I again used cd-hit-est to cluster the clusters against each other. Here I are trying to find paralogs that we want to remove from the reference. This was done at at 80% for con90_sub2_uni_cat_uni_POSE.fasta, 80% for con84_sub2_uni_cat_uni_POSE.fasta, and 68% for con84_sub2_uni_cat_uni_POSE.fasta. I used ran this batch script:
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=30
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=cdhit
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jacquelinepena412@gmail.com
# cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering
./../cd-hit/cd-hit-est -i con90_sub2_uni_cat_uni_POSE.fasta -o con90at80_sub2_uni_cat_uni_POSE.fasta -M 0 -T 0 -c 0.80 -mask N -n 4