Claident

ver. 0.9.2022

Claident ver. 0.9.2022 

Claident tutrialの手順に従う 

ref ClaidentTutrial.pdf

Data:

Illumina sequence

Tag=除去済み

Primer:

[Fprimer_EUBAC.fas]

>EUBAC

NNNNCCTACGGGNGGCWGCAG

[Rprimer_EUBAC.fas]

>EUBACR

NNNNGACTACHVGGGTATCTAATCC

I. OTU/ASV preparation

clsplitseq: primerにより品質管理しprimer配列を除去 & ファイル名をclaident用に変更

ref. MetabarcodingTextbook: Sequence analysis

clsplitseq /

--runname=H19M03D10 / # サンプル名(必要);FとRで共通 

--tagname=TAG / # TAGは除去済みだが、ファイル名作成のため(必要)

--forwardprimerfile=Fprimer_EUBAC.fas / # FPrimer配列(NNNNを含む)(例 10.3MB; 46,643 reads) 

--reverseprimerfile=Rprimer_EUBAC.fas  / # RPrimer配列(NNNNを含む)(例 12.1MB; 46,643 reads)

--truncateN=ENABLE   / # Primerの5'末端にNNNNを付加した場合(必要)

--numthreads=12 /

--append / # 同一フォルダに出力 

./data/H19M03D10Bac_S285_L001_R1_001.fastq.gz / # 入力ファイルR1(forward鎖) 

./data/H19M03D10Bac_S285_L001_R2_001.fastq.gz  / #  入力ファイルR2reverse鎖)

./1_rmprimer # 出力フォルダ


4つのファイルが1_rmprimer内に作成される

  • [runname]__[tagname]__[primername].forward.fastq.gz (6.2MB; 14,664 reads)
  • [runname]__[tagname]__[primername].reverse.fastq.gz (7.9MB; 14,664 reads)
  • [runname]__[tagname]__undetermined..forward.fastq.gz (137.5kB; 1,225 reads)
  • [runname]__[tagname]__undetermind.reverse.fastq.gz (175.0kB)

2. Forward配列とReverse配列の連結

clconcatpairv: F&R配列の連結部分の長さ・品質・ミスマッチを設定して連結する

clconcatpair /

--mode=OVL / # Overlap mode

--numthreads=12 /

./1_rmprimer / # 入力ファイル(FRのペア)が入ったフォルダ

./2_combined # 出力フォルダ

3. 最低配列長によりfiltering

clfilterseqv: 

設定項目:

clfilterseqv /

--minlen=100 /

--maxnee=1 / #よくわからん

--maxNs=0 /

--numthreads=12 /

./2_combined /

./3_filtered


4. DADA2@Rによるdenoising

cldenoiseseqd

cldenoiseseqd /

--numthreads=12 /

./3_filtered / # 入力ファイル(連結されたfastq)が入ったフォルダ

./4_denoised # 出力フォルダ

-----

出力フォルダ内

  • denoised.fasta
  • denoised.otu.gz
  • denoised.tsv
  • plotErrors.pdf
  • runDADA2.R

5. UCHIME3によるChimera removing

clremovechimev

clremovechimev /

--mode=BOTH /

--uchimedenovo=3 /

--referencedb=silva138.1SSUref / # Bacteria16S用(DB=/usr/local/share/claident/uchimedb)

--numthreads=12 /

./4_denoised / # 入力ファイル(denoised.fasta; denoised.otu.gz; denoised.tsv)が入ったフォルダ

./5_rmchimera # 出力フォルダ

6. OTU作成

clclassseqv

clclassseqv /

--minident=0.99 /

--strand=plus /

--numthreads=12 /

./5_rmchimera / # 入力ファイル(denoised.fasta; denoised.otu.gz; denoised.tsv)が入ったフォルダ

./6_OTU99 # 出力フォルダ

II. Taxonomy Identification

結果用フォルダを作成する

% mkdir ./10_ClusteredSeqAnalysis 

Database for taxonomy & blast in claident

animals_COX1_genusanimals_COX1_genus_mananimals_COX1_speciesanimals_COX1_species_mananimals_COX1_species_wospanimals_COX1_species_wosp_mananimals_COX1_species_wspanimals_COX1_species_wsp_mananimals_mt_genusanimals_mt_genus_mananimals_mt_speciesanimals_mt_species_mananimals_mt_species_wospanimals_mt_species_wosp_mananimals_mt_species_wspanimals_mt_species_wsp_maneukaryota_LSU_genuseukaryota_LSU_genus_maneukaryota_LSU_specieseukaryota_LSU_species_maneukaryota_LSU_species_wospeukaryota_LSU_species_wosp_maneukaryota_LSU_species_wspeukaryota_LSU_species_wsp_maneukaryota_SSU_genuseukaryota_SSU_genus_maneukaryota_SSU_specieseukaryota_SSU_species_maneukaryota_SSU_species_wospeukaryota_SSU_species_wosp_maneukaryota_SSU_species_wspeukaryota_SSU_species_wsp_manfungi_ITS_genusfungi_ITS_genus_manfungi_ITS_speciesfungi_ITS_species_manfungi_ITS_species_wospfungi_ITS_species_wosp_manfungi_ITS_species_wspfungi_ITS_species_wsp_manfungi_all_genusfungi_all_genus_manfungi_all_speciesfungi_all_species_manfungi_all_species_wospfungi_all_species_wosp_manfungi_all_species_wspfungi_all_species_wsp_manoverall_classoverall_familyoverall_genusoverall_genus_manoverall_orderoverall_speciesoverall_species_manoverall_species_wospoverall_species_wosp_manoverall_species_wspoverall_species_wsp_manplants_cp_genusplants_cp_genus_manplants_cp_speciesplants_cp_species_manplants_cp_species_wospplants_cp_species_wosp_manplants_cp_species_wspplants_cp_species_wsp_manplants_matK_genusplants_matK_genus_manplants_matK_speciesplants_matK_species_manplants_matK_species_wospplants_matK_species_wosp_manplants_matK_species_wspplants_matK_species_wsp_manplants_rbcL_genusplants_rbcL_genus_manplants_rbcL_speciesplants_rbcL_species_manplants_rbcL_species_wospplants_rbcL_species_wosp_manplants_rbcL_species_wspplants_rbcL_species_wsp_manplants_trnH-psbA_genusplants_trnH-psbA_genus_manplants_trnH-psbA_speciesplants_trnH-psbA_species_manplants_trnH-psbA_species_wospplants_trnH-psbA_species_wosp_manplants_trnH-psbA_species_wspplants_trnH-psbA_species_wsp_manprokaryota_16S_genusprokaryota_16S_genus_manprokaryota_16S_speciesprokaryota_16S_species_manprokaryota_16S_species_wospprokaryota_16S_species_wosp_manprokaryota_16S_species_wspprokaryota_16S_species_wsp_manprokaryota_all_genusprokaryota_all_genus_manprokaryota_all_speciesprokaryota_all_species_manprokaryota_all_species_wospprokaryota_all_species_wosp_manprokaryota_all_species_wspprokaryota_all_species_wsp_man

7. QCauto法

clmakecachedb

clmakecachedb /

--blastdb=overall_class /

--numthreads=12 /

./6_OTU99nee0silva/clustered.fasta / # 入力ファイル(clustered.fasta: OTU/ASVの代表配列)が入ったフォルダ

./10_ClusteredSeqAnalysis/overall_class_OTU99nee0sil # 出力フォルダ


clidentseq /

--blastdb=./10_ClusteredSeqAnalysis/ovall_class_OTU99nee0sil / # clmakecachedbの出力フォルダ

--numthreads=12 /

./6_OTU99_nee0_silva/clustered.fasta / # 入力ファイル=clmakecachedbと同一の.fastaファイル

./10_ClusteredSeqAnalysis/nhqc_overall_class_OTU99nee0sil.txt # 出力フォルダ


classigntax /

--taxdb=overall_class / # clmakecachedbと同じ名前

--minnsupporter=1 /

./10_ClusteredSeqAnalysis/nhqc_overall_class_OTU99nee0sil.txt # clidentseqの出力ファイル

./10_ClusteredSeqAnalysis/nhqc_tax_overallclass_OTU99nee0sil.tsv # 出力フォルダ


clcom

#!/bin/bash

###################################################################

# Claident ver.0.9.2022

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

# Pipeline for Illumina-fastq-sequences 

# 0. Preparation: remove primers

#

# 1. Combine Foward and Reverse seq [clconcatpair]

#    

###################################################################

# Obtain options

if [ $# -ne 4 ]; then

echo "--------------------------------------------------------"

    echo "Usage: "

    echo "$0 [DIR] [NUM1] [NUM2] [PrimerID] [TAG]"

    echo "    [DIR]=Directory where RAW.FASTQ.GZ files are stored"

#    echo "    [NUM1]=Similarities for OTU-clustering    ex) 0.92-0.95-0.97"

    echo "    [NUM2]=Length for sequence-cut-off    ex) 300"

    echo "    [PrimerID]=Primer file neame    ex) EUBAC,EAV4,PCYA,18SV4V5, 18S515..."

    echo "    [TAG]=TAG-id (dammy)   ex)WB... "

    echo "Working directory should be a directory where [DIR] is"

echo "--------------------------------------------------------"

exit 1

fi


WD=$(pwd)

DIRraw=$1  #SIM=$2

LEN=$2

PRIMER=$3

TAGID=$4


#CVALS=($(echo $SIM | tr '-' ' ')) #echo ${CVALS[@]}

echo "-----------------------------------------------------------"

echo "Parameters: "

echo "  WD:  $WD"

echo "  Directory where data are: $DIRraw"

echo "  OTU-Similarity:  $SIM"

echo "  Cut-off-length:  $LEN"

echo "-----------------------------------------------------------"


cd $WD/$DIRraw

mkdir ./1_rmprimers

if [ -e "$WD/$DIRraw/2_combined" ]; then

    echo "2_combined had been created !!"

    :

else

    echo "Processing ... clsplitseqv,clconcatpair, clfilterseqv and cldenoiseseqd (DADA2) !!"


##### clsplitseq: Remove primer-seqs

# obtain Primer files

Fprimer=$(ls -1 ~/local/share/primers/ | grep "$PRIMER" | grep "Fprimer")

Rprimer=$(ls -1 ~/local/share/primers/ | grep "$PRIMER" | grep "Rprimer")


cp ~/local/share/primers/$Fprimer $WD/$DIRraw/

cp ~/local/share/primers/$Rprimer $WD/$DIRraw/


# Obtain RunID

FILES=$(ls -1 | grep ".fastq")    # FILES=$(ls -1 | grep ".fastq.gz")


RunIDs0=()

for FASTQ in $FILES; do

    RunID=$(echo ${FASTQ%%L001_*})

    RunIDs0=("${RunIDs0[@]}" $RunID)

done

RunIDs=$(echo ${RunIDs0[@]} | sed 's/\s/\n/g' | uniq )

echo "-----------------------------------------------------------"

echo "RunIDs: "

echo "$RunIDs"

echo "-----------------------------------------------------------"


# clsplitseq: Remove primers each RunID

for ELM1 in $RunIDs; do


FFILE=$(ls | grep "$ELM1" | grep "R1")

RFILE=$(ls | grep "$ELM1" | grep "R2")

echo "-----------------------------------------------------------"

echo "F-FILE: $FFILE"

echo "R-FILE: $RFILE"

echo "-----------------------------------------------------------"


#--indexname=$TAGID \

clsplitseq \

--runname=$ELM1 \

--tagname=$TAGID \

--primerfile=$Fprimer \

--reverseprimerfile=$Rprimer \

--truncateN=ENABLE \

--append \

--numthreads=12 \

$FFILE \

$RFILE \

./1_rmprimers

done


# clconctpair: Make combined seq

clconcatpairv \

--mode=OVL \

--numthreads=12 \

--minovllen=10 \

./1_rmprimers \

./2_combined


# clfilterseq: Remove low-quality seqs 

#CMBFQ=$(ls -1 ./2_combined/ | grep ".fastq.gz" | grep -v "filt" | grep -v "undetermined")

#for ELM2 in $CMBFQ; do

#COMBed=$(echo './2_combined/'"$ELM2")

#FILTed=$(echo './2_combined/'${ELM2%.fastq*}'.filt.fastq.gz')

clfilterseqv \

--maxnee=1 \

--minlen=$LEN \

--numthreads=12 \

./2_combined \

./3_filtered


# cldenoiseseqd: Denoising using DADA2

cldenoiseseqd \

--numthreads=12 \

./3_filtered \

./4_denoised


# clremovechimev: Remove chimera using UCHIME3 with silva138.1SSUref

clremovechimev \

--mode=BOTH \

--uchimedenovo=3 \

--referencedb=silva138.1SSUref \

--numthreads=12 \

./4_denoised \

./5_rmchimera


fi


echo "------------------------------"

echo "Finish: Denoising by DADA2 and Chimera remove !!"

echo "------------------------------"



##### 2nd Part #####

# clclassseqv: Clustering

clclassseqv \

--minident=0.99 \

--strand=plus \

--numthreads=12 \

./5_rmchimera/nonchimeras.fasta \

./6_OTU99


echo "------------------------------"

echo "Finish: Clustering !!"

echo "------------------------------"



###### Part3 ######

# Identification1: SINA


# SINA

SINADBF=$(ls -1 ~/local/db/ | grep "SILVA" | grep "NR99" | grep ".arb$")

SINADB_PATH=$(echo "/home/taklab/local/db/$SINADBF")


echo "#----------------------------------"

echo "SINA_DB file: $SINADBF"

echo "SINA_DB path: $SINADB_PATH"

echo "#----------------------------------"


#----- SINA for DADA2 ASV

cd $WD/$DIRraw/5_rmchimera

INFILE4SINA=$(echo 'nonchimeras.fasta')

MATRIX=$(echo 'nonchimeras.tsv')

OFILESINA1=$(echo 'asv_alsina.fas')

OFILESINA2=$(echo "${OFILESINA1%.*}.csv")


sina -i $INFILE4SINA -o $OFILESINA1 --meta-fmt csv --threads 12 --db $SINADB_PATH --fs-kmer-len 10 --search --search-min-sim 0.65 --lca-fields tax_slv


#-- Add SINAid to data-matirx

perl ~/local/bin/sinaid2clres.perl $OFILESINA2 $MATRIX


SINAWIDF=$(ls -1 | grep "_wrnmx_trps.csv" | tr -d '\n')


#-- Making ARB data 

perl ~/local/bin/algnfas2gb4arbinport.perl $OFILESINA1 $SINAWIDF

#-- Making a SUMMARY

perl ~/local/bin/length_fasta.perl > length_fas_histgram.csv



#----- SINA for OTU99

cd $WD/$DIRraw/6_OTU99

INFILE4SINA=$(echo 'clustered.fasta')

MATRIX=$(echo 'clustered.tsv')

OFILESINA1=$(echo 'otu99_alsina.fas')

OFILESINA2=$(echo "${OFILESINA1%.*}.csv")


sina -i $INFILE4SINA -o $OFILESINA1 --meta-fmt csv --threads 12 --db $SINADB_PATH --fs-kmer-len 10 --search --search-min-sim 0.65 --lca-fields tax_slv 


#-- Add SINAid to data-matirx

perl ~/local/bin/sinaid2clres.perl $OFILESINA2 $MATRIX


SINAWIDF=$(ls -1 | grep "_wrnmx_trps.csv" | tr -d '\n')


#-- Making ARB data 

perl ~/local/bin/algnfas2gb4arbinport.perl $OFILESINA1 $SINAWIDF

#-- Making a SUMMARY

perl ~/local/bin/length_fasta.perl > length_fas_histgram.csv



echo "------------------------------"

echo "Finish: SINA Search !!"

echo "  Alignment.fasta : $OFILESINA1"

echo "  Summary of SINA : $OFILESINA2"

echo "  Matrix(W-IDs)   : $SINAWIDF"

echo "  Alignment4ARB.gb: "

echo "------------------------------"