PeakSeq For CHIP-seq v1

Original paper link: http://www.nature.com/nbt/journal/v27/n1/abs/nbt.1518.html Software website http://www.gersteinlab.org/proj/PeakSeq/

I have modified the source code of the software, so that it needs minimum intervention to run the software for different projects. Please follow the steps to analyze CHIP-seq data on Oscar. I used Drosophila melanogaster CHIP-seq data as an example.


Note: You can copy and paste all the text in blue to your Linux command line to run. Anything with "#" is comment, and will be IGNORED by Linux.

#Because some scripts need large memory, also some of them can run run multiple instances in parallel to speed up the analysis, I prefer to use the smp nodes
ssh smp007

#check if the node is busy. If yes, try other smp nodes. The available nodes are from "smp001" to "smp007"
bash
top

# Create a folder to work within. Notice that files in the scratch folder more than 4 weeks old will be automatically deleted
.
mkdir scratch/peakseq-test &&  cd scratch/peakseq-test

#create a subfolder to download the reference genome and prepare mappability files:
mkdir reference && cd reference

#download the fasta reference files  You can take a look at the files in here.
#Since PeakSeq doesn't provide a browser, you should check what version of reference is available on your browser, and download the same version of reference as the browser.
#We need all the chromosome sequences, but each file should only contain one chromosome sequence as required by PeakSeq.
#Here, after I download all the fasta files, I will delete the file dmel-all-chromosome-r5.22.fasta.gz which contains all the sequences for all the chromosomes in the single file.

wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.22_FB2009_09/fasta/*chro* && rm dmel-all-chromosome-r5.22.fasta.gz

#unzip and rename the .fasta files to .fa as required by PeakSeq

gunzip *.gz && rename fasta fa *fasta

# In the directory containing the fa files, do (note the number 36 is the length of the alignment. You should change it to the same as your aligner setting.).
/gpfs/runtime/bioinfo/peakseq/mappability/compile_new.py 36

#open the chromosome list file "chr-list.txt" which was generated in the last step, remove the the chromosomes you don't care about in your CHIP study. In my case, this is my original list:
less ../chr-list.txt

2R 2RHet 3L 3LHet 3R 3RHet 4 dmel_mitochondrion_genome X XHet YHet 2L Uextra 2LHet U

# Here I only care about the following chromosomes. Others are not my focus. So after I edit the file, its content is:
#note: you can also only do one chromosome first, then do others after you are confident about the first chromosome's results.

less ../chr-list.txt


2L 2R 3L 3R 4  X  

#run the script to create the mappability file.
/gpfs/runtime/bioinfo/peakseq/mappability/map_new.py  


#get the list of genome fasta files and prepare bowtie reference files, notice  'dmel5.22' is my reference name which will be needed by bowtie. You can name it by any name you like.
fastas=` ls *.fa | tr '\n' ','`   && /gpfs/runtime/bioinfo/bowtie/bowtie-build -f $fastas dmel5.22

#then move the reference files into the data folder for bowtie
mv *.ebwt ..

#copy .sequence.txt files from Gerald output. In my case, file s_3_sequence contains read sequences from my input DNA and file s_1_sequence.txt contains my CHIP-seq reads.
#It is good idea to start from only one lane for the sample, and one lane for the control. We will do other lanes after getting good results from the first two lanes.

cd .. && cp .../GERALD_09-07-2010_xyz/s_3_sequence.txt . &&  cp ../GERALD_09-07-2010_xyz/s_1_sequence.txt .

#create a lane list which will be used later on by PeakSeq. In the current case, I only have one sample lane 1 and one control lane 3
echo sample lanes: 1 control lane: 3 > lane-list.txt


#run the bowtie aligner, 8 CPU, print time, not allow mismatch, find unique match, first base marked as 1   the job can also submit to cluster
#for the full list of command line parameters, use command: bowtie --help

read_file=s_1_sequence.txt && bowtie_file=`echo $read_file | sed 's/_sequence.txt/.bowtie/g'`
/gpfs/runtime/bioinfo/bowtie/bowtie dmel5.22 -p 8  -t -v 1 -m 1 -B 1 --phred64-quals $read_file  > $bowtie_file

#run the bowtie aligner, 8 CPU, print time, not allow mismatch, find unique match, first base marked as 1   the job can also submit to cluster
#for the full list of command line parameters, use command: bowtie --help
read_file=s_3_sequence.txt && bowtie_file=`echo $read_file | sed 's/_sequence.txt/.bowtie/g'`
/gpfs/runtime/bioinfo/bowtie/bowtie dmel5.22 -p 8  -t -v 1 -m 1 -B 1 --phred64-quals $read_file  > $bowtie_file

#use the following bash script to separate the bowtie output into each chromosome (notice column 3 of the bowtie output file is chromosome id)
#the awk command need bash shell to work correctly
bash
awk -F$'\t' '$3!="*" {print $0 > "split_s_1_"$3".bowtie"}' s_1.bowtie

awk -F$'\t' '$3!="*" {print $0 > "split_s_3_"$3".bowtie"}' s_3.bowtie

#calculate signal map for the sample lane, fragment size is the is the average length of the DNA fragments in the sequenced DNA library. You need set the number according to your library
#for the full list of command line parameters, use command: create_signal_map_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new.pl --fragment_size 450&

/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new.pl --fragment_size 450&
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new.pl --fragment_size 450

#After all the instances all finish successfully, a text file named "create_signal_map_all_done.txt" should exist in the current folder.
ls create_signal_map_all_done.txt
create_signal_map_all_done.txt
#If you don't see this file, run ls start0* to check which one is not done and figure out the problem. Then delete the start0_xxxx file, re-run the script.

#calculate signal map, this will do simulation, estimate false discovery rate and print out results
#for the full list of command line parameters, use command: score_hits_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new.pl --fragment_size 450 &
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new.pl --fragment_size 450 &
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new.pl --fragment_size 450

#After all the instances finish successfully, a text file named "score_hits_all_done.txt" should exist in the current folder.

ls score_hits_all_done.txt
score_hits_all_done.txt
#If you don't see this file, run ls start1* to check which one is not done and figure out the problem. Then delete the start1_xxxx file, re-run the script.

#find the candidate peaks which are not in control, this will do linear regression, calculate P value from binomial distribution and print out the candidates.
#for the full list of command line parameters, use command: filter_hits_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources.
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new.pl --fragment_size 450&
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new.pl --fragment_size 450&
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new.pl --fragment_size 450 

#After all the instances finish successfully, a text file named "filter_hits_all_done.txt" should exist in the current folder.
ls filter_hits_all_done.txt
filter_hits_all_done.txt
#If you don't see this file, run ls start2* to check which one is not done and figure out the problem. Then delete the start2_xxxx file, re-run the script.

 #adjust the p value for the candidates
/gpfs/runtime/bioinfo/peakseq/peakseq/adjust_and_select_hits_new.pl

#download the gff files, which tell the position of the genes on  the chromosome
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.22_FB2009_09/gff/*.gff.gz

#unzip all the gff files, find the gene starting points for all the chromosomes  #This is not part of the original PeakSeq,  I still in alpha version.   
gunzip *.gff.gz && cat *.gff | grep 'gene' | /gpfs/runtime/bioinfo/peakseq/peakseq/get_gene_position_from_gff.pl

#find the nearest gene starting points next to the candidate peaks. #This is not part of the original PeakSeq,  I still in alpha version.   
/gpfs/runtime/bioinfo/peakseq/peakseq/query_gene_near_peak.pl

#next, prepare data for IGV from Broad to review the data in browser.
#convert bowtie to sam
/gpfs/runtime/bioinfo/samtools/misc/bowtie2sam.pl s_1.bowtie > s_1.sam &
/gpfs/runtime/bioinfo/samtools/misc/bowtie2sam.pl s_3.bowtie > s_3.sam

#sort and index the files for IGV
# sort the alignment file using igv tools
/gpfs/runtime/bioinfo/igvtools/igvtools sort s_1.sam s_1_sorted.sam &
/gpfs/runtime/bioinfo/igvtools/igvtools sort s_3.sam s_3_sorted.sam

#index the sorted sam alignemnt file
/gpfs/runtime/bioinfo/igvtools/igvtools index s_1_sorted.sam &
/gpfs/runtime/bioinfo/igvtools/igvtools index s_3_sorted.sam


#the sorted and indexed sam files can be viewed using IGV from Broad: http://www.broadinstitute.org/igv/



PeakSeq Article abstract


Nature Biotechnology 27, 66 - 75 (2009)
Published online: 4 January 2009 | doi:10.1038/nbt.1518

PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls

Joel Rozowsky1, Ghia Euskirchen2, Raymond K Auerbach3, Zhengdong D Zhang1, Theodore Gibson1, Robert Bjornson4, Nicholas Carriero4, Michael Snyder1,2 & Mark B Gerstein1,3,4


Abstract

Chromatin immunoprecipitation (ChIP) followed by tag sequencing (ChIP-seq) using high-throughput next-generation instrumentation is fast, replacing chromatin immunoprecipitation followed by genome tiling array analysis (ChIP-chip) as the preferred approach for mapping of sites of transcription-factor binding and chromatin modification. Using two deeply sequenced data sets for human RNA polymerase II and STAT1, each with matching input-DNA controls, we describe a general scoring approach to address unique challenges in ChIP-seq data analysis. Our approach is based on the observation that sites of potential binding are strongly correlated with signal peaks in the control, likely revealing features of open chromatin. We develop a two-pass strategy called PeakSeq to compensate for this. A two-pass strategy compensates for signal caused by open chromatin, as revealed by inclusion of the controls. The first pass identifies putative binding sites and compensates for genomic variation in the 'mappability' of sequences. The second pass filters out sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Our scoring procedure enables us to optimize experimental design by estimating the depth of sequencing required for a desired level of coverage and demonstrating that more than two replicates provides only a marginal gain in information.


Paper link: http://www.nature.com/nbt/journal/v27/n1/abs/nbt.1518.html Software website http://www.gersteinlab.org/proj/PeakSeq/




Species other than fruit fly

 
For projects other than fruit fly, you can prepare the mappability file and the reference yourself. I modified the source code of the software, so that it needs minimum intervention to run the software for different projects. Please follow the steps to analyze CHIP-seq data on Oscar. I used Drosophila melanogaster CHIP-seq data as an example.

#Because some scripts need large memory, also some of them can run
run multiple instances in parallel to speed up the analysis, I prefer to use the smp nodes
ssh smp007

#check if the node is busy. If yes, try other smp nodes. The available nodes are from "smp001" to "smp007"
top

# Create a folder to work within. Notice that files in the scratch folder more than 4 weeks old will be automatically deleted
.
mkdir scratch/peakseq-test &&  cd scratch/peakseq-test

##work on chromosome mappability file

#create a subfolder to download the reference genome and prepare mappability files:
mkdir reference && cd reference

#download the fasta reference files  You can take a look at the files in here.
#Since PeakSeq doesn't provide a browser, you should check what version of reference is available on your browser, and download the same version of reference as the browser.
#We need all the chromosome sequences, but each file should only contain one chromosome sequence as required by PeakSeq.
#Here, after I download all the fasta files, I will delete the file dmel-all-chromosome-r5.22.fasta.gz which contains all the sequences for all the chromosomes in the single file.

wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.22_FB2009_09/fasta/*chro* && rm dmel-all-chromosome-r5.22.fasta.gz

#unzip and rename the .fasta files to .fa as required by PeakSeq

gunzip *.gz && rename fasta fa *fasta

# In the directory containing the fa files, do (note the number 36 is the length of the alignment. You should change it to the same as your aligner setting.).
/gpfs/runtime/bioinfo/peakseq/mappability/compile_new1.py 36

#run the script to create the mappability file.
/gpfs/runtime/bioinfo/peakseq/mappability/map_new1.py  


#prepare reference file for bowtie
#get the list of genome fasta files and prepare bowtie reference files, notice 'dmel5.22' is my reference name which will be needed by bowtie. You can name it by any name you like.

fastas=` ls *.fa | tr '\n' ','`   && /gpfs/runtime/bioinfo/bowtie/bowtie-build -f $fastas dmel5.22

#then move the reference files into the data folder for bowtie
mv *.ebwt .. && cd ..

#next, I will run bowtie aligner and align my CHIP sample lane (lane 1) and input DNA lane (lane 3) to fruit fly reference genome.
#It is good idea to start from only one lane for the sample, and one lane for the control. We will do other lanes after getting good results from the first two lanes.

#Here I use GERALD ouput file in _sequence.txt format as my read files. If you have raw qseq.txt format from
Bustard, the following command can do the conversion for lane 1:
#       read_files=
/xyz/xyz/Bustard07-10-2010_xyz/s_1_1_*_qseq.txt 
#       cat $read_files | perl /gpfs/runtime/bioinfo/bin/qseq2fastq.pl > s_1_sequence.txt

#run the bowtie aligner, 8 CPU, print time, not allow mismatch, find unique match, first base marked as 1   the job can also submit to cluster
#for the full list of command line parameters, use command: bowtie --help

read_file= .../GERALD_08-01-2010_xyz/s_1_sequence.txt
bowtie_file=s_1.bowtie

/gpfs/runtime/bioinfo/bowtie/bowtie dmel5.22 -p 8  -t -v 1 -m 1 -B 1 --phred64-quals $read_file  > $bowtie_file

#run the bowtie aligner, 8 CPU, print time, not allow mismatch, find unique match, first base marked as 1   the job can also submit to cluster
#for the full list of command line parameters, use command: bowtie --help
read_file= .../GERALD_08-01-2010_xyz/s_3_sequence.txt
bowtie_file=s_3.bowtie

/gpfs/runtime/bioinfo/bowtie/bowtie dmel5.22 -p 8  -t -v 1 -m 1 -B 1 --phred64-quals $read_file  > $bowtie_file

#use the following bash script to separate the bowtie output into each chromosome (notice column 3 of the bowtie output file is chromosome id)
bash
awk -F$'\t' '$3!="*" {print $0 > "split_s_1_"$3".bowtie"}' s_1.bowtie

awk -F$'\t' '$3!="*" {print $0 > "split_s_3_"$3".bowtie"}' s_3.bowtie

#append default configure file to the fonfigure file in current folder
cat /gpfs/runtime/bioinfo/peakseq/data/config-default.txt >> config.txt

#edit the configure file to reflect your analysis need. For me, my sample lane is 1 and control lane is 3. My average fragment size is 500bp. 
#the default chromosome is 2L 2R 3L 3R 4  and X. It is good idea to  use only one small chromosome (like 4) to test the software and before run all the chromosomes. 
vi config.txt


#calculate signal map for the sample lane, fragment size is the is the average length of the DNA fragments in the sequenced DNA library. You need set the number according to your library
#for the full list of command line parameters, use command: create_signal_map_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new1.pl &

/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new1.pl &
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new1.pl

#After all the instances all finish successfully, a text file named "create_signal_map_all_done.txt" should exist in the current folder.
ls create_signal_map_all_done.txt
create_signal_map_all_done.txt
#If you don't see this file, run ls start_crteate_signal_map* to check which one is not done and figure out the problem. Then delete the start_create_signal_map_xxxx file, re-run the script.

#calculate signal map, this will do simulation, estimate false discovery rate and print out results
#for the full list of command line parameters, use command: score_hits_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new1.pl

#After all the instances finish successfully, a text file named "score_hits_all_done.txt" should exist in the current folder.

ls score_hits_all_done.txt
#If you don't see this file, run ls start_score_hits* to check which one is not done and figure out the problem. Then delete the start_score_hits_xxxx file, re-run the script.

#find the candidate peaks which are not in control, this will do linear regression, calculate P value from binomial distribution and print out the candidates.
#for the full list of command line parameters, use command: filter_hits_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources.
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new1.pl  

#After all the instances finish successfully, a text file named "filter_hits_all_done.txt" should exist in the current folder.
ls filter_hits_all_done.txt
filter_hits_all_done.txt
#If you don't see this file, run ls start_filter_hits* to check which one is not done and figure out the problem. Then delete the start_filter_hits_xxxx file, re-run the script.

 #adjust the p value for the candidates
/gpfs/runtime/bioinfo/peakseq/peakseq/adjust_and_select_hits_new1.pl

#download the gff files, which tell the position of the genes on  the chromosome
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.22_FB2009_09/gff/*.gff.gz

#unzip all the gff files, find the gene starting points for all the chromosomes  #This is not part of the original PeakSeq,  I still in alpha version.   
gunzip *.gff.gz && cat *.gff | grep 'gene' | /gpfs/runtime/bioinfo/peakseq/peakseq/get_gene_position_from_gff.pl

#find the nearest gene starting points next to the candidate peaks. #This is not part of the original PeakSeq,  I still in alpha version.   
/gpfs/runtime/bioinfo/peakseq/peakseq/query_gene_near_peak.pl

#take a look at the results. You can also use MS Excel or other spreadsheet software to check the data
#the last column could be directly copied and pasted to IGV as the chromosome region
less candi_final_with_gene.txt

#to make a chart to present where the peaks located relative to genes on each chromosom
#this is a R script, which will create a pdf file. It contains one chart for each chromosome.
#There is four curves on each chart. Each curve represents a quarter of total peaks, according peak height.
#open the pdf file: peak_location_by_height_per_chr.pdf using Adobe Reader to see the result.
Rscript /gpfs/runtime/bioinfo/peakseq/peakseq/chip-seq_peak_location_by_peak_height_per_chr.r


#to summarize the stantistic of the candidate peaks. #This is not part of the original PeakSeq,  I still in alpha version.   

/gpfs/runtime/bioinfo/peakseq/peakseq/summarize_peak1.pl

#take a look at the summary.
You can also use MS Excel or other spreadsheet software to check the data
less summmarize_peak_result.txt

#next, prepare data for IGV from Broad to review the data in browser.
#convert bowtie to sam
/gpfs/runtime/bioinfo/samtools/misc/bowtie2sam.pl s_1.bowtie > s_1.sam &
/gpfs/runtime/bioinfo/samtools/misc/bowtie2sam.pl s_3.bowtie > s_3.sam

#sort and index the files for IGV
# sort the alignment file using igv tools
/gpfs/runtime/bioinfo/igvtools/igvtools sort s_1.sam s_1_sorted.sam &
/gpfs/runtime/bioinfo/igvtools/igvtools sort s_3.sam s_3_sorted.sam

#index the sorted sam alignemnt file
/gpfs/runtime/bioinfo/igvtools/igvtools index s_1_sorted.sam &
/gpfs/runtime/bioinfo/igvtools/igvtools index s_3_sorted.sam


#the sorted and indexed sam files can be viewed using IGV from Broad: http://www.broadinstitute.org/igv/