PeakSeq For CHIP-seq v2

Original paper link: http://www.nature.com/nbt/journal/v27/n1/abs/nbt.1518.html Software website http://www.gersteinlab.org/proj/PeakSeq/

PeakSeq For Drosophila melanogaster CHIP-seq data (for other species, click this link)

For fruit fly, I have created the mappability file for 36 bp read mapping based on genome reference dmel5.22 from www.flybase.org. I also modified the source code of the software, so that it needs minimum intervention to run the software. Please follow the steps to analyze CHIP-seq data on Oscar.

Note: You can copy and paste all the text in blue to your Linux command line to run. Anything with "#" is comment, and will be IGNORED by Linux.

# Request an interactive node to run the software. Here I request, 4 CPU core, 20G memory and 2 hours computing time. You may need wait some time if someone else is using the nodes. 
# You can adjust the number according to the size of your data set.
# Please note that it is a good idea to request a little more time, otherwise, you job may get terminated before it finishes.  
interact -n 4 -m 20g -t 2:00:00 -q timeshare

# Create a folder to work within. Notice that files in the scratch folder more than 4 weeks old will be automatically deleted.
mkdir scratch/peakseq-test &&  cd scratch/peakseq-test

1) Run aligner/mapper "Bowtie" on CHIP and INPUT

# Next, I will run bowtie aligner and align my CHIP sample lane (lane 1) and input DNA lane (lane 3) to fruit fly reference genome.
# It is good idea to start from only one lane for the sample, and one lane for the control. We will do other lanes after getting good results from the first two lanes.

# Here I use fastq files as read files. If you have raw qseq.txt format from
Bustard, the following command can do the conversion for lane 1:
#       read_files=
/xyz/xyz/Bustard07-10-2010_xyz/s_1_1_*_qseq.txt 
#       cat $read_files | perl /gpfs/runtime/bioinfo/bin/qseq2fastq.pl > s_1.fq 

# Run the bowtie aligner, 8 CPU, print time, not allow mismatch, find unique match, first base marked as 1   the job can also submit to cluster
# For the full list of command line parameters, use command: bowtie --help

read_file= ~/my/path/to/fastq_files/s_1.fq
bowtie_file=s_1.bowtie
/gpfs/runtime/bioinfo/bowtie/bowtie dmel5.22 -p 8  -t -v 1 -m 1 -B 1 --phred33-quals $read_file  > $bowtie_file

# Run the bowtie aligner, 8 CPU, print time, not allow mismatch, find unique match, first base marked as 1   the job can also submit to cluster
# For the full list of command line parameters, use command: bowtie --help
read_file= ~/my/path/to/fastq_files/s_3.fq
bowtie_file=s_3.bowtie
/gpfs/runtime/bioinfo/bowtie/bowtie dmel5.22 -p 8  -t -v 1 -m 1 -B 1 --phred33-quals $read_file  > $bowtie_file

2) separate alignment results into smaller files according to the reference sequence

# Use the following bash script to separate the bowtie output into each chromosome (notice column 3 of the bowtie output file is chromosome id)

# The awk command need bash shell to work correctly, if you your default shell is not bash, please run command "bash" first.
awk -F$'\t' '$3!="*" {print $0 > "split_s_1_"$3".bowtie"}' s_1.bowtie
awk -F$'\t' '$3!="*" {print $0 > "split_s_3_"$3".bowtie"}' s_3.bowtie

3) Create signal map (pileups) according to the alignment results

# Copy the mappability file for fruit fly 36bp read mapping to the the current folder
cp /gpfs/runtime/bioinfo/peakseq/data/Mappability-dmel5.22-36bp.txt Mappability.txt

# Copy configure file, which contain the chromosome sizes, chromosome ID of your interest, the CHIP-seq lane ID and input DNA lane ID.
cp /gpfs/runtime/bioinfo/peakseq/data/config-dmel5.22.txt config.txt

# Edit the configure file to reflect your analysis need. For me, my sample lane is 1 and control lane is 3. My average fragment size is 500bp. 
# The default chromosome is 2L 2R 3L 3R 4  and X. It is good idea to  use only one small chromosome (like 4) to test the software and before run all the chromosomes. 
vi config.txt
 
# Calculate signal map for the sample lane, fragment size is the is the average length of the DNA fragments in the sequenced DNA library. You need set the number according to your library
# For the full list of command line parameters, use command: create_signal_map_new.pl --help
# You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new1.pl &
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new1.pl &
/gpfs/runtime/bioinfo/peakseq/preprocessing/create_signal_map_new1.pl

# After all the instances all finish successfully, a text file named "create_signal_map_all_done.txt" should exist in the current folder.
ls create_signal_map_all_done.txt
create_signal_map_all_done.txt
# If you don't see this file, run ls start_create_signal_map* to check which one is not done and figure out the problem. Then delete the corresponding start_create_signal_map_xxxx file, re-run the script.

4) Identify peaks and score them

#calculate signal map, this will do simulation, estimate false discovery rate and print out results

#for the full list of command line parameters, use command: score_hits_new.pl --help
#You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/score_hits_new1.pl


# After all the instances finish successfully, a text file named "score_hits_all_done.txt" should exist in the current folder.
ls score_hits_all_done.txt
score_hits_all_done.txt
# If you don't see this file, run ls start_score_hits* to check which one is not done and figure out the problem. Then delete the corresponding start_score_hits_xxxx file, re-run the script.

5) Filter peaks and adjust p values

# Find the candidate peaks which are not in control, this will do linear regression, calculate P value from binomial distribution and print out the candidates.

# For the full list of command line parameters, use command: filter_hits_new.pl --help
# You can run multiple instances of the script in parallel to speed up the analysis according to the available computing resources.
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new1.pl &
/gpfs/runtime/bioinfo/peakseq/peakseq/filter_hits_new1.pl  


# After all the instances finish successfully, a text file named "filter_hits_all_done.txt" should exist in the current folder.
ls filter_hits_all_done.txt
filter_hits_all_done.txt
# If you don't see this file, run ls start_filter_hits* to check which one is not done and figure out the problem. Then delete the corresponding start_filter_hits_xxxx file, re-run the script.

# Adjust the p value for the candidates
/gpfs/runtime/bioinfo/peakseq/peakseq/adjust_and_select_hits_new1.pl

6) Identify peaks near genes and summarize the result

# Copy gene position file to current folder

cp /gpfs/runtime/bioinfo/peakseq/data/gene_location-dmel5.22.txt gene_location.txt

# Find the nearest gene starting points next to the candidate peaks. #This is not part of the original PeakSeq,  It still in alpha version.   
/gpfs/runtime/bioinfo/peakseq/peakseq/query_gene_near_peak.pl

# Take a look at the results. You can also use MS Excel or other spreadsheet software to check the data
# The last column could be directly copied and pasted to IGV as the chromosome region
less candi_final_with_gene.txt

# To summarize the stantistic of the candidate peaks. #This is not part of the original PeakSeq,  I still in alpha version.   
/gpfs/runtime/bioinfo/peakseq/peakseq/summarize_peak1.pl

# Take a look at the summary. You can also use MS Excel or other spreadsheet software to check the data
less summmarize_peak_result.txt

7) prepare sam files to be reviewed by IGV

# Next, prepare data for IGV from Broad to review the data in browser.
# convert bowtie to sam  
# Please ignore the warnings "Warning: use uninitialized value in xxx"
/gpfs/runtime/bioinfo/samtools-0.1.8/misc/bowtie2sam.pl s_1.bowtie > s_1.sam &
/gpfs/runtime/bioinfo/samtools-0.1.8/misc/bowtie2sam.pl s_3.bowtie > s_3.sam


# sort and index the files for IGV
# sort the alignment file using igv tools
/gpfs/runtime/bioinfo/igvtools/igvtools sort s_1.sam s_1_sorted.sam &
/gpfs/runtime/bioinfo/igvtools/igvtools sort s_3.sam s_3_sorted.sam

# index the sorted sam alignemnt file
/gpfs/runtime/bioinfo/igvtools/igvtools index s_1_sorted.sam &
/gpfs/runtime/bioinfo/igvtools/igvtools index s_3_sorted.sam


8) Review "sam" file, "wig" files by IGV

# The sorted and indexed sam files can be viewed using IGV from Broad: http://www.broadinstitute.org/igv/
# The step: /gpfs/runtime/bioinfo/peakseq/peakseq/adjust_and_select_hits_new1.pl also create two wig files for reviewing the result in IGV together with the sam files side by side.
# wig_file_for_igb_with_adjust_P_sample_s_1.wig contains enrichment data for the final candidate peaks.
# wig_file_for_igb_without_adjust_P_sample_xxx.wig contains enrichment data for the peaks passed all the criteria except that the p value is not adjusted. So this file have more peaks.

9) Plot peak locations relative to gene grouped by peak height using R script

# To make a chart to present where the peaks located relative to genes on each chromosom

# This is a R script, which will create a pdf file. It contains one chart for each chromosome.
# There is four curves on each chart. Each curve represents a quarter of total peaks, according peak height.
# Open the pdf file: peak_location_by_height_per_chr.pdf using Adobe Reader to see the result.

Rscript /gpfs/runtime/bioinfo/peakseq/peakseq/chip-seq_peak_location_by_peak_height_per_chr.r

10) Plot a segment of chromosome with gene structure and sample data

# To plot a chromosome segment together with chip-seq and control data, you can run the following command to create a pdf file "chromosome region on chr X from xxx to xxx.pdf".
# You can change the values sample, control, chr, from and to to plot the chromosome region of your interest.


export sample=s_1 control=s_3 chr=X from=5671000 to=5691000

Rscript /gpfs/runtime/bioinfo/peakseq/peakseq/chip-seq_chromosome_region_chart_with_sample_data.r


# Take a look at the figure. Change the parameters if needed.


11) run MEME to identify motifs on top 10 high peaks

# To prepare the fasta format sequence file as MEME's input, run the following command. This will get the sequence for the top 5 high peaks for each chromosome
Rscript /gpfs/runtime/bioinfo/peakseq/peakseq/chip-seq_prepare_fasta_for_meme_to_find_motif_5_peaks_per_chr.r

# To get the top 10 peaks for each chromosome, run this command. This will be longer for meme to look for mortifs because there are 10 sequences per chromosome.
Rscript /gpfs/runtime/bioinfo/peakseq/peakseq/chip-seq_prepare_fasta_for_meme_to_find_motif_10_peaks_per_chr.r


# Load meme module
module load meme

# Call meme to look for motifs, for a full list of available parameters, run 
meme  --help 
meme genome_seq_for_meme_to_look_for_motif.fasta -dna -maxsize 4000000

# MEME will create a folder called meme_out, which contains a file name meme.html. Open it with a browser and check the results.

Species other than fruit fly

 
Coming soon or refer the bottom part of version 1.

PeakSeq Article abstract


Nature Biotechnology 27, 66 - 75 (2009)
Published online: 4 January 2009 | doi:10.1038/nbt.1518

PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls

Joel Rozowsky1, Ghia Euskirchen2, Raymond K Auerbach3, Zhengdong D Zhang1, Theodore Gibson1, Robert Bjornson4, Nicholas Carriero4, Michael Snyder1,2 & Mark B Gerstein1,3,4


Abstract

Chromatin immunoprecipitation (ChIP) followed by tag sequencing (ChIP-seq) using high-throughput next-generation instrumentation is fast, replacing chromatin immunoprecipitation followed by genome tiling array analysis (ChIP-chip) as the preferred approach for mapping of sites of transcription-factor binding and chromatin modification. Using two deeply sequenced data sets for human RNA polymerase II and STAT1, each with matching input-DNA controls, we describe a general scoring approach to address unique challenges in ChIP-seq data analysis. Our approach is based on the observation that sites of potential binding are strongly correlated with signal peaks in the control, likely revealing features of open chromatin. We develop a two-pass strategy called PeakSeq to compensate for this. A two-pass strategy compensates for signal caused by open chromatin, as revealed by inclusion of the controls. The first pass identifies putative binding sites and compensates for genomic variation in the 'mappability' of sequences. The second pass filters out sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Our scoring procedure enables us to optimize experimental design by estimating the depth of sequencing required for a desired level of coverage and demonstrating that more than two replicates provides only a marginal gain in information.


Paper link: http://www.nature.com/nbt/journal/v27/n1/abs/nbt.1518.html Software website http://www.gersteinlab.org/proj/PeakSeq/