Claident
ver. 0.9.2022
Claident ver. 0.9.2022
ref. GitHub@Claident
Installation logs: Linux_Ubuntu_z800
Claident tutrialの手順に従う
Data:
Illumina sequence
Tag=除去済み
Primer:
[Fprimer_EUBAC.fas]
>EUBAC
NNNNCCTACGGGNGGCWGCAG
[Rprimer_EUBAC.fas]
>EUBACR
NNNNGACTACHVGGGTATCTAATCC
I. OTU/ASV preparation
- Primerの除去
clsplitseq: primerにより品質管理しprimer配列を除去 & ファイル名をclaident用に変更
ref. MetabarcodingTextbook: Sequence analysis
defaultでは、primer領域のミスマッチを、F-primer配列(最大0.14%), R-primer配列(最大0.15%)まで許容
clsplitseq /
--runname=H19M03D10 / # サンプル名(必要);FとRで共通
--tagname=TAG / # TAGは除去済みだが、ファイル名作成のため(必要)
--forwardprimerfile=Fprimer_EUBAC.fas / # FPrimer配列(NNNNを含む)(例 10.3MB; 46,643 reads)
--reverseprimerfile=Rprimer_EUBAC.fas / # RPrimer配列(NNNNを含む)(例 12.1MB; 46,643 reads)
--truncateN=ENABLE / # Primerの5'末端にNNNNを付加した場合(必要)
--numthreads=12 /
--append / # 同一フォルダに出力
./data/H19M03D10Bac_S285_L001_R1_001.fastq.gz / # 入力ファイルR1(forward鎖)
./data/H19M03D10Bac_S285_L001_R2_001.fastq.gz / # 入力ファイルR2(reverse鎖)
./1_rmprimer # 出力フォルダ
4つのファイルが1_rmprimer内に作成される
- [runname]__[tagname]__[primername].forward.fastq.gz (6.2MB; 14,664 reads)
- [runname]__[tagname]__[primername].reverse.fastq.gz (7.9MB; 14,664 reads)
- [runname]__[tagname]__undetermined..forward.fastq.gz (137.5kB; 1,225 reads)
- [runname]__[tagname]__undetermind.reverse.fastq.gz (175.0kB)
2. Forward配列とReverse配列の連結
clconcatpairv: F&R配列の連結部分の長さ・品質・ミスマッチを設定して連結する
defaultでは、overlap長=10bp, ...
clconcatpair /
--mode=OVL / # Overlap mode
--numthreads=12 /
./1_rmprimer / # 入力ファイル(FRのペア)が入ったフォルダ
./2_combined # 出力フォルダ
3. 最低配列長によりfiltering
clfilterseqv:
--minqual :3'末端の品質設定 ➡︎ 可能だが連結後は無意味??
--maxnee=2.0:maximum number of expected errors(いくつまでエラー塩基を許すか?)☜要勉強(除去数は1,2,0の順)
設定項目:
--minlen:最低配列長
--maxnNs=0(default)<-- Need to remove error seq for next step
clfilterseqv /
--minlen=100 /
--maxnee=1 / #よくわからん
--maxNs=0 /
--numthreads=12 /
./2_combined /
./3_filtered
4. DADA2@Rによるdenoising
cldenoiseseqd :
cldenoiseseqd /
--numthreads=12 /
./3_filtered / # 入力ファイル(連結されたfastq)が入ったフォルダ
./4_denoised # 出力フォルダ
-----
出力フォルダ内
- denoised.fasta
- denoised.otu.gz
- denoised.tsv
- plotErrors.pdf
- runDADA2.R
5. UCHIME3によるChimera removing
clremovechimev :
clremovechimev /
--mode=BOTH /
--uchimedenovo=3 /
--referencedb=silva138.1SSUref / # Bacteria16S用(DB=/usr/local/share/claident/uchimedb)
--numthreads=12 /
./4_denoised / # 入力ファイル(denoised.fasta; denoised.otu.gz; denoised.tsv)が入ったフォルダ
./5_rmchimera # 出力フォルダ
6. OTU作成
clclassseqv :
clclassseqv /
--minident=0.99 /
--strand=plus /
--numthreads=12 /
./5_rmchimera / # 入力ファイル(denoised.fasta; denoised.otu.gz; denoised.tsv)が入ったフォルダ
./6_OTU99 # 出力フォルダ
II. Taxonomy Identification
結果用フォルダを作成する
% mkdir ./10_ClusteredSeqAnalysis
Database for taxonomy & blast in claident
7. QCauto法
clmakecachedb :
mamori=16.4GB, DB=overall_classは時間かかるのでDB中の検索対象配列を予め選抜する(10000 reads)
clmakecachedb /
--blastdb=overall_class /
--numthreads=12 /
./6_OTU99nee0silva/clustered.fasta / # 入力ファイル(clustered.fasta: OTU/ASVの代表配列)が入ったフォルダ
./10_ClusteredSeqAnalysis/overall_class_OTU99nee0sil # 出力フォルダ
clidentseq /
--blastdb=./10_ClusteredSeqAnalysis/ovall_class_OTU99nee0sil / # clmakecachedbの出力フォルダ
--numthreads=12 /
./6_OTU99_nee0_silva/clustered.fasta / # 入力ファイル=clmakecachedbと同一の.fastaファイル
./10_ClusteredSeqAnalysis/nhqc_overall_class_OTU99nee0sil.txt # 出力フォルダ
classigntax /
--taxdb=overall_class / # clmakecachedbと同じ名前
--minnsupporter=1 /
./10_ClusteredSeqAnalysis/nhqc_overall_class_OTU99nee0sil.txt # clidentseqの出力ファイル
./10_ClusteredSeqAnalysis/nhqc_tax_overallclass_OTU99nee0sil.tsv # 出力フォルダ
clcom
#!/bin/bash
###################################################################
# Claident ver.0.9.2022
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Pipeline for Illumina-fastq-sequences
# 0. Preparation: remove primers
#
# 1. Combine Foward and Reverse seq [clconcatpair]
#
###################################################################
# Obtain options
if [ $# -ne 4 ]; then
echo "--------------------------------------------------------"
echo "Usage: "
echo "$0 [DIR] [NUM1] [NUM2] [PrimerID] [TAG]"
echo " [DIR]=Directory where RAW.FASTQ.GZ files are stored"
# echo " [NUM1]=Similarities for OTU-clustering ex) 0.92-0.95-0.97"
echo " [NUM2]=Length for sequence-cut-off ex) 300"
echo " [PrimerID]=Primer file neame ex) EUBAC,EAV4,PCYA,18SV4V5, 18S515..."
echo " [TAG]=TAG-id (dammy) ex)WB... "
echo "Working directory should be a directory where [DIR] is"
echo "--------------------------------------------------------"
exit 1
fi
WD=$(pwd)
DIRraw=$1 #SIM=$2
LEN=$2
PRIMER=$3
TAGID=$4
#CVALS=($(echo $SIM | tr '-' ' ')) #echo ${CVALS[@]}
echo "-----------------------------------------------------------"
echo "Parameters: "
echo " WD: $WD"
echo " Directory where data are: $DIRraw"
echo " OTU-Similarity: $SIM"
echo " Cut-off-length: $LEN"
echo "-----------------------------------------------------------"
cd $WD/$DIRraw
mkdir ./1_rmprimers
if [ -e "$WD/$DIRraw/2_combined" ]; then
echo "2_combined had been created !!"
:
else
echo "Processing ... clsplitseqv,clconcatpair, clfilterseqv and cldenoiseseqd (DADA2) !!"
##### clsplitseq: Remove primer-seqs
# obtain Primer files
Fprimer=$(ls -1 ~/local/share/primers/ | grep "$PRIMER" | grep "Fprimer")
Rprimer=$(ls -1 ~/local/share/primers/ | grep "$PRIMER" | grep "Rprimer")
cp ~/local/share/primers/$Fprimer $WD/$DIRraw/
cp ~/local/share/primers/$Rprimer $WD/$DIRraw/
# Obtain RunID
FILES=$(ls -1 | grep ".fastq") # FILES=$(ls -1 | grep ".fastq.gz")
RunIDs0=()
for FASTQ in $FILES; do
RunID=$(echo ${FASTQ%%L001_*})
RunIDs0=("${RunIDs0[@]}" $RunID)
done
RunIDs=$(echo ${RunIDs0[@]} | sed 's/\s/\n/g' | uniq )
echo "-----------------------------------------------------------"
echo "RunIDs: "
echo "$RunIDs"
echo "-----------------------------------------------------------"
# clsplitseq: Remove primers each RunID
for ELM1 in $RunIDs; do
FFILE=$(ls | grep "$ELM1" | grep "R1")
RFILE=$(ls | grep "$ELM1" | grep "R2")
echo "-----------------------------------------------------------"
echo "F-FILE: $FFILE"
echo "R-FILE: $RFILE"
echo "-----------------------------------------------------------"
#--indexname=$TAGID \
clsplitseq \
--runname=$ELM1 \
--tagname=$TAGID \
--primerfile=$Fprimer \
--reverseprimerfile=$Rprimer \
--truncateN=ENABLE \
--append \
--numthreads=12 \
$FFILE \
$RFILE \
./1_rmprimers
done
# clconctpair: Make combined seq
clconcatpairv \
--mode=OVL \
--numthreads=12 \
--minovllen=10 \
./1_rmprimers \
./2_combined
# clfilterseq: Remove low-quality seqs
#CMBFQ=$(ls -1 ./2_combined/ | grep ".fastq.gz" | grep -v "filt" | grep -v "undetermined")
#for ELM2 in $CMBFQ; do
#COMBed=$(echo './2_combined/'"$ELM2")
#FILTed=$(echo './2_combined/'${ELM2%.fastq*}'.filt.fastq.gz')
clfilterseqv \
--maxnee=1 \
--minlen=$LEN \
--numthreads=12 \
./2_combined \
./3_filtered
# cldenoiseseqd: Denoising using DADA2
cldenoiseseqd \
--numthreads=12 \
./3_filtered \
./4_denoised
# clremovechimev: Remove chimera using UCHIME3 with silva138.1SSUref
clremovechimev \
--mode=BOTH \
--uchimedenovo=3 \
--referencedb=silva138.1SSUref \
--numthreads=12 \
./4_denoised \
./5_rmchimera
fi
echo "------------------------------"
echo "Finish: Denoising by DADA2 and Chimera remove !!"
echo "------------------------------"
##### 2nd Part #####
# clclassseqv: Clustering
clclassseqv \
--minident=0.99 \
--strand=plus \
--numthreads=12 \
./5_rmchimera/nonchimeras.fasta \
./6_OTU99
echo "------------------------------"
echo "Finish: Clustering !!"
echo "------------------------------"
###### Part3 ######
# Identification1: SINA
# SINA
SINADBF=$(ls -1 ~/local/db/ | grep "SILVA" | grep "NR99" | grep ".arb$")
SINADB_PATH=$(echo "/home/taklab/local/db/$SINADBF")
echo "#----------------------------------"
echo "SINA_DB file: $SINADBF"
echo "SINA_DB path: $SINADB_PATH"
echo "#----------------------------------"
#----- SINA for DADA2 ASV
cd $WD/$DIRraw/5_rmchimera
INFILE4SINA=$(echo 'nonchimeras.fasta')
MATRIX=$(echo 'nonchimeras.tsv')
OFILESINA1=$(echo 'asv_alsina.fas')
OFILESINA2=$(echo "${OFILESINA1%.*}.csv")
sina -i $INFILE4SINA -o $OFILESINA1 --meta-fmt csv --threads 12 --db $SINADB_PATH --fs-kmer-len 10 --search --search-min-sim 0.65 --lca-fields tax_slv
#-- Add SINAid to data-matirx
perl ~/local/bin/sinaid2clres.perl $OFILESINA2 $MATRIX
SINAWIDF=$(ls -1 | grep "_wrnmx_trps.csv" | tr -d '\n')
#-- Making ARB data
perl ~/local/bin/algnfas2gb4arbinport.perl $OFILESINA1 $SINAWIDF
#-- Making a SUMMARY
perl ~/local/bin/length_fasta.perl > length_fas_histgram.csv
#----- SINA for OTU99
cd $WD/$DIRraw/6_OTU99
INFILE4SINA=$(echo 'clustered.fasta')
MATRIX=$(echo 'clustered.tsv')
OFILESINA1=$(echo 'otu99_alsina.fas')
OFILESINA2=$(echo "${OFILESINA1%.*}.csv")
sina -i $INFILE4SINA -o $OFILESINA1 --meta-fmt csv --threads 12 --db $SINADB_PATH --fs-kmer-len 10 --search --search-min-sim 0.65 --lca-fields tax_slv
#-- Add SINAid to data-matirx
perl ~/local/bin/sinaid2clres.perl $OFILESINA2 $MATRIX
SINAWIDF=$(ls -1 | grep "_wrnmx_trps.csv" | tr -d '\n')
#-- Making ARB data
perl ~/local/bin/algnfas2gb4arbinport.perl $OFILESINA1 $SINAWIDF
#-- Making a SUMMARY
perl ~/local/bin/length_fasta.perl > length_fas_histgram.csv
echo "------------------------------"
echo "Finish: SINA Search !!"
echo " Alignment.fasta : $OFILESINA1"
echo " Summary of SINA : $OFILESINA2"
echo " Matrix(W-IDs) : $SINAWIDF"
echo " Alignment4ARB.gb: "
echo "------------------------------"