SPP for Chip-seq

SPP is a R package especially designed for the analysis of Chip-Seq data from Illummina platform. The package was developed by Peter Park's group from Harvard Medical School. Here is the the original nature paper:
http://www.nature.com/nbt/journal/v26/n12/full/nbt.1508.html

Here is the original tutorial created by Peter Kharchenk from the same group:
http://compbio.med.harvard.edu/Supplements/ChIP-seq/tutorial.html

Here is an "introduction to Chip-seq Analysis using the SPP Package" created by Stefan Bekiranov of University of Virginia:
http://prometheus.cshl.org/twiki/bin/viewfile/Main/LabStuff?rev=1;filename=CSHL_SB_SPP.pdf

Based paper and tutorials above, I have created a guide to use the package. Please notice: part of this guild was from the material mentioned above.


We will calculate smoothed read enrichment profile and identify Pol II site sites that significantly enriched compared to input control. IGB can be used to browse the data. We will also use the biomaRt package to download annotation and identify genes within 2kb of the transcription start site (TSS).

Note: You can copy and paste all the text in blue to your Linux command line to run. Anything with "#" is comment, and will be IGNORED by Linux.


1) Run aligner "Bowtie" on CHIP and INPUT (other aligners, such as eland, MAP and Arachne, can also be used.)

Note: you should make sure that ensembl (see below) has the same version of reference as you used in bowtie aligner.


# Because some scripts need large memory, also some of them can run multiple instances in parallel to speed up the analysis, I prefer to use the smp nodes.
# Please you should not use the head nodes to run these analysis
interact -n 2 -t 2:00:00  -m 10g -q timeshare -X   

# the -X parameter will allow linux to forward graphic output from R

# Create a folder to work within. Notice that files in the scratch folder more than 4 weeks old will be automatically deleted.
mkdir
scratch/spp-test &&  cd scratch/spp-test

# load related modules
module load bowtie/0.12.9 samtools/0.1.18 R/2.15.2

# Next, I will run bowtie aligner and align my CHIP sample lane (lane 1) and input DNA lane (lane 2) to fruit fly reference genome.
# It is good idea to start from only one lane for the sample, and one lane for the control. We will do other lanes after getting good results from the first two lanes.

# Run the bowtie aligner, 2 CPU, print time, not allow mismatch, find unique match, first base marked as 1.
# For the full list of command line parameters, use command: bowtie --help
read_file=
/gpfs/data/shared/biomed/example_fastq/rna_seq/s1_r1.fastq
bowtie dmel5 -p 2  -t -v 1 -m 1 -B 1 $read_file  -S s1.sam
samtools view -bS s1.sam > s1.bam 

read_file=/gpfs/data/shared/biomed/example_fastq/rna_seq/s2_r1.fastq
bowtie dmel5 -p 2  -t -v 1 -m 1 -B 1 $read_file  -S s2.sam
samtools view -bS s2.sam > s2.bam 

2) Start R and load R libraries.

# Start R
R

# After R starts, load R libraries
library(spp)
library(biomaRt)


3) Read in bowtie alignments

chip.data <- read.bam.tags("s1.bam")
input.data <- read.bam.tags("s2.bam")

# Use this command to save the R objects into a RData file, so that you can load them back into R later on using command: load("8.11.RData")
save.image(file="
8.11.RData")

4) Identify binding characteristics

# This command will uses cross-correlation profile to calculate binding peak separation distance,
# and assess whether inclusion of tags with non-perfect alignment quality improves the cross-correlation peak.
# If you would like to accept all aligned tags, specify accept.all.tags=T argument to save time.

# srange gives the possible range for the size of the protected region
# It should be higher than tag length; making the upper boundary too high will increase calculation time
# bin - bin tags within the specified number of base pairs to speed up calculation
# increasing bin size decreases the accuracy of the determined parameters
binding.characteristics <- get.binding.characteristics(chip.data,srange=c(50,500),bin=5, accept.all.tags = T)

# Print out binding peak separation distance
print(paste("binding peak separation distance =",binding.characteristics$peak$x))

# Plot cross-correlation profile, you can use pdf reader to review the plot once it is created.
pdf(file="s_1.crosscorrelation.pdf",width=5,height=5)
par(mar = c(3.5,3.5,1.0,0.5), mgp = c(2,0.65,0), cex = 0.8)
plot(binding.characteristics$cross.correlation,type='l',xlab="strand shift",ylab="cross-correlation")
abline(v=binding.characteristics$peak$x,lty=2,col=2)
dev.off()


save.image(file="8.11-0.RData")    # you can use command load("8.11-0.RData") to load all the objects into R later on.

5) select informative tags based on the binding characteristics

# The following calls will select tags with acceptable alignment quality:
# The chip.data and input.data will be a simple list of tag coordinate vectors. The analysis of binding characteristics described above could have been skipped by accepting
# all of the tag alignments after reading the Eland files (i.e. chip.data.info <- chip.data$tags).
# Since version 1.2, the steps can also be skipped by simply not passing the binding.characteristics argument to the select.informative.tags call.
chip.data.info <- select.informative.tags(chip.data,binding.characteristics)
input.data.info <- select.informative.tags(input.data,binding.characteristics)


save.image(file="8.11-1.RData")   # you can use command load("8.11-1.RData") to load all the objects into R later on.   

6) remove singular positions with very high tag counts

# The method calls below will scan along the chromosomes calculating local density of region (can be specified using window.size parameter, default is 200bp), removing or restricting singular positions with extremely high tag count relative to the neighborhood.
# restrict or remove singular positions with very high tag counts
chip.data.qua <- remove.local.tag.anomalies(chip.data.info)
input.data.qua <- remove.local.tag.anomalies(input.data.info)


save.image(file="8.11-2.RData")  # you can use command load("8.11-2.RData") to load all the objects into R later on.

7) Calculate genome-wide tag density and tag enrichment/depletion profiles

# This will calculate genome-wide smoothed tag density (subtracting re-scaled input).
# Note that the tags are shifted by half of the peak separation distance
tag.shift <- round(binding.characteristics$peak$x/2)
smoothed.density <- get.smoothed.tag.density(chip.data.qua,control.tags=input.data.qua,bandwidth=200,step=100,tag.shift=tag.shift)
writewig(smoothed.density,"s_1.smoothed.density.wig","s_1 smoothed, background-subtracted tag density")

# The WIG file can be read with genome browsers, such as IGB.

# The following calls will scan ChIP and signal tag density to estimate lower bounds of tag enrichment (and upper bound of tag depletion if it is significant) along the genome.
# The resulting profile gives conservative statistical estimates of log2 fold-enrichment ratios along the genome.
# The example below uses a window of 500bp (and background windows of 1, 5, 25 and 50 times that size) and a confidence interval corresponding to 1%.
# alpha specifies significance level at which confidence intervals will be estimated
enrichment.estimates <- get.conservative.fold.enrichment.profile(chip.data.qua,input.data.qua,fws=500,step=100,alpha=0.01)
writewig(enrichment.estimates,"s_1.enrichment.estimates.wig","s_1 conservative fold-enrichment/depletion estimates shown on log2 scale")

# The WIG file can be read with genome browsers, such as IGB.

# Broad regions of enrichment for a specified scale can be quickly identified using the following command:
broad.clusters <- get.broad.enrichment.clusters(chip.data.qua, input.data.qua, window.size=1e3, z.thr=3,tag.shift=round(binding.characteristics$peak$x/2))

# Write out in broadPeak format
write.broadpeak.info(broad.clusters,"s_1.broadPeak")

# We'll also write a bed file that contains broad regions of enrichment on chromosome X for browsing in IGB.
write.table(cbind(rep("X", length(broad.clusters$X$s)), broad.clusters$X$s, broad.clusters$X$e), file="s_1_enrich_broad_chrX.bed", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")

8) Detecting point binding positions

# The example below will use WTD method to call binding positions, using FDR of 5% and a window size estimated by the binding.characteristics:

# binding detection parameters
# desired FDR (5%). Alternatively, an E-value can be supplied to the method calls below instead of the fdr parameter
fdr <- 5e-2
# The binding.characteristics contains the optimized half-size for binding detection window
detection.window.halfsize <- binding.characteristics$whs;
 
# Determine binding positions using wtd method
bp <- find.binding.positions(signal.data=chip.data.qua,control.data=input.data.qua,fdr=fdr,whs=detection.window.halfsize)

print(paste("detected",sum(unlist(lapply(bp$npl,function(d) length(d$x)))),"peaks"))
 
# Output detected binding positions
output.binding.results(bp,"s_1.binding.positions.txt");

# We'll also write a graph file (similar to wig) that contains the binding sites and their associated 10log10(FDR) for browsing in IGB.
write.table(cbind(bp$npl$X$x, -10*log(bp$npl$X$fdr)), file="binding_sites_fdr_chrX.gr", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")

# Upload this bed file into IGB by going to File ! Open and browse the results.

#  Note: the high-lighted part could be ignored if you don't want to try the 'MCT' method. For detail about the method, please check the original paper

# Alternatively, the binding positions can be determined using MTC method (referred here as lwcc):

bp = find.binding.positions(signal.data=chip.data.qua, control.data=input.data.qua, fdr=fdr, method=tag.lwcc, whs=detection.window.halfsize)
# By default, the methods exclude large (> 104bp) regions that exhibit significantly higher input tag density than expected.
# To prevent such exclusion use tec.filter=F option.

save.image(file="8.11-3.RData") # can use load("8.11-3.RData") to load all the objects into R later on.

# Broader regions of enrichment associated with the determined peaks can be determined using the following:
bp = add.broad.peak.regions(chip.data.qua, input.data.qua, bp, window.size=1000,z.thr=3)

# Output using narrowPeak format
write.narrowpeak.binding(bp,"s_1.narrowPeak")

# We'll also write a graph file (similar to wig) that contains the binding sites and their associated 10log10(FDR) as well as
# The start and end positions for the broad enriched regions in bed format for browsing in IGB
write.table(cbind(bp$npl$X$x,-10*log(bp$npl$X$fdr)), file="s_1_MTC_binding_sites_fdr_X.gr", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="nt")
write.table(cbind(rep("X", length(rm.na(bp$npl$X$rs))), rm.na(bp$npl$X$rs), rm.na(bp$npl$X$re)), file="s_1_enrich_broad_X.bed", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")

# Upload these files into IGB by going to File -> Open and browse the results.

save.image(file="8.11-4.RData")    # you can use command load("8.11-4.RData") to load all the objects into R later on.

9) Comparing Binding Sites to Annotations Using the biomaRt package

Note: you should make sure that ensembl has the same version of reference as you used in bowtie aligner.


# We will download the ENSEMBLE fruit fly genome annotations and generate a list of ENSEMBLE gene information on chromosome X including start position, end position, strand and description
ensembl = useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
genes.chrX = getBM(attributes = c("chromosome_name", "start_position", "end_position", "strand", "description"), filters = "chromosome_name", values= "X", mart = ensembl)


# Next, we're going to take our binding sites from the bp list and use it to determine the set of genes that contain significantly enriched Pol II within 2kb of their TSS.

# In order to compare PolII sites to TSS sites, we need to write an overlap function where bs represents a binding site position, ts is the annotated TSS and l is the allowed distance of the binding site from the TSS.
overlap = function(bs, ts, l)
{
    if ((bs > ts - l) && (bs < ts + l)) {
        TRUE;
    } else {
        FALSE;
    }
}

# Now we'll write a function that takes a vector of binding site values, start positions, end positions and strands of the genes on chromosome X
# as well as our distance cutoff. l and outputs a logical vector of the genes that contain a Pol II site within l bp (i.e., TRUE value)
# or do not contain a Pol II site (i.e., FALSE value).
fivePrimeGenes = function(bs, ts, te, s, l) {
    fivePrimeVec = logical();
    for (i in 1:length(ts)) {
            fivePrime = FALSE;
            for (j in 1:length(bs)) {
                if (s[i] == 1) {
                    fivePrime = fivePrime || overlap(bs[j], ts[i], l);
                } else {
                    fivePrime = fivePrime || overlap(bs[j], te[i], l);
                }
             }
            fivePrimeVec = c(fivePrimeVec, fivePrime);
    }
     fivePrimeVec;
}

# Using the fivePrimeGenes function, generate a vector of the TSSs and genes that contain Pol II within .2kb of their TSS (i.e., l = 2000).
fivePrimeGenesLogical = fivePrimeGenes(bp$npl$X$x, genes.chrX$start_position, genes.chrX$end_position, genes.chrX$strand, 2000)

# Find the gene located on the plus strand
fivePrimeStartsPlus = genes.chrX$start_position[fivePrimeGenesLogical & genes.chrX$strand == 1]

# Find the gene located on the minus strand
fivePrimeStartsMinus = genes.chrX$end_position[fivePrimeGenesLogical & genes.chrX$strand == -1]

# Combine the start positions together
fivePrimeStarts = sort(c(fivePrimeStartsPlus, fivePrimeStartsMinus))

# Get all the gene names
fivePrimeGenes = genes.chrX$description[fivePrimeGenesLogical]

save.image(file="8.11-5.RData")     # you can use command load("8.11-5.RData") to load all the objects into R later on.

# Explore the vectors fivePrimeStarts and fivePrimeGenes.
# Use IGB to check if the genes/starts are the correct ones.
# You can use annotations in the biomaRt package to access other tools including Bioconductor to
# learn more about the set of genes that contain Pol II or whatever regulatory factor you're studying. Have fun!