Part IIIa: Analysis of conserved regions and putative regulatory elements
Coding regions have been extensively analyzed for positive, neutral, or relaxing selection, primarily through the study of nonsynonymous vs synonymous nucleotide substitutions (omega = dN/dS) rates. Non-coding regions, on the other hand, have often been neglected and not thoroughly explored. Whole genome alignment (WGA) provides a unique opportunity to investigate these genomic regions in various ways. In this section, we will compute two metrics: the PhyloP CONACC and the identification of Conserved Elements (CEs) via PhastCons scores using a reference genome. These metrics aim to identify functional elements - candidate regulatory elements. Additionally, I am supplementing you with previously identified ATAC (Assay for Transposase-Accessible Chromatin) peaks, for the study of chromatin accessibility, which in turn provides insights into the regulatory landscape of the genome.
The PhyloP CONACC metric employs the likelihood ratio test (LRT) method to calculate conservation scores for each site in the alignment. The scores are then outputted in the fixed-step wiggle format. Positive values (log p-values or scores) indicate conservation, while negative values indicate acceleration. On the other hand, PhastCons scores are computed using a hidden Markov model-based method that estimates the probability of each nucleotide belonging to a conserved element (CE), based on the multiple alignments. This method takes into account not only the alignment of each individual nucleotide but also its flanking regions. Therefore, the PhastCons score represents the posterior probability of purifying selection and ranges between 0 and 1.
Both metrics rely on the nucleotide substitution model, which needs to be calculated. The first step is to estimate a neutral model. One way to compote this is considering the 4-fold degenerate sites. The genetic code is redundant, meaning multiple codons can code for the same amino acid. Consequently, certain changes in the DNA sequence do not alter the amino acid sequence of the protein. For example, changes in the first or second position of a codon may result in a different amino acid being incorporated, while changes in the third position may not have any impact. Specifically, changes at the third position of a codon can lead to "synonymous" or "silent" mutations that do not affect the protein's amino acid sequence. Since four different codons can code for the same amino acid at these four-fold degenerate sites, mutations at these positions are more likely to be neutral or silent and less likely to have a functional impact on the protein. Due to their neutrality, we can utilize these sites to estimate the rate of neutral mutations.
I provided a gene annotation of H. melpomene (Hmel.Chr2.annotation.gff3) in gff3 format our reference species, to extract at first coding regions and subsequently the 4-fold degenerative sites using hal4dExtract from hal tools.
> Get the coding regions (hint: use a combination of gffread -h and grep -P)
Solution >
gffread --no-pseudo -C -T -o Hmel.CDS.gtf Hmel.Chr2.annotation.gff3
--
grep -P "\tCDS\t" Hmel.CDS.gtf > Hmel_filt.CDS.gtf
--
> Convert to GenePred format (hint: use gtfToGenePred)
Solution >
gtfToGenePred Hmel_filt.CDS.gtf Hmel.CDS.gp
> And finally, to bed format (hint: use genePredToBed)
Solution >
genePredToBed Hmel.CDS.gp Hmel.CDS.bed
> Extract the coordinates of the 4-fold degenerative sites with hal4dExtract
Solution >
hal4dExtract HelicChr2.hal Hmel Hmel.CDS.bed Hmel.4d.bed
Once we extracted the coordinates of those regions, we have to extract the relative alignment in a converted MAF format using hal2maf. MAF is a format for storing multiple sequence alignments. It represents an alignment as a series of blocks, each of which contains one or more sequences and their alignment positions.
> Extract those regions removing duplicated and ancestral genomes and using only ortholog sequences.
Solution >
hal2maf --onlyOrthologs --noDupes --refGenome Hmel --refTargets Hmel.4d.bed HelicChr2.hal HelicChr2.4d.maf
> Now, using the phylogenetic tree and the extracted regions, build the neutral model using phyloFit. We can specify few options, such as the substitution model. There are a few, usually, REV and SSREV model works well enough.
(Hint: You can also try the python wrapper halPhyloPTrain.py which should be faster as parallelize the job).
Solution >
phyloFit --tree newick.tree --subst-mod SSREV --sym-freqs --out-root neutralModel.4d HelicChr2.4d.maf
--
halPhyloPTrain.py --substMod REV --noAncestors HelicChr2.hal Hmel Hmel.4d.bed neutralModel.mod --numProc 10
--
halPhyloPMP.py --numProc 12 HelicChr2.hal Hmel neutralModel.4d.mod Hmel.phyloP.wig
--
> Finally using the neutral model, we can run phyloP on the whole alignment converted in MAF format. We can select whichever reference genome; internal branches are also accepted. Here you can use H. melpomene or H. erato. Remember, you can compress the output converting the wig file into a big wig.
Solution >
hal2maf --refGenome Hmel --noAncestors --noDupes HelicChr2.hal HelicChr2.maf
--
(Alternative: cactus-hal2maf ./js HelicChr2.hal HelicChr2.maf --refGenome Hmel --chunkSize 100000 --noAncestors)
--
phyloP --mode CONACC --wig-scores --method LRT neutralModel.4d.mod HelicChr2.maf | sed 's/HelicChr2/Hmel202001o/' > Hmel.CONACC.phyloP.wig
--
wigToBigWig Hmel.CONACC.phyloP.wig Hmel.Chr2.fasta.fai Hmel.CONACC.phyloP.bw
In the same fashion we can also generate PhastCons scores and the relative CEs using phastCons. But before that, instead of a neutral model phastCons needs a model of the "conserved" state and another for "non-conserved" state. The model for conserved regions is optional. If it is not given, then this model is defined indirectly by "scaling" the model for non-conserved regions by a factor called rho (see the --estimate-rho option to phastCons). The basic idea of the program is to scan along the alignment for regions that are better "explained" by the conserved model than by the non-conserved model; such regions will be output as CEs, and the probability that each base is in such a region will be output as the conservation score for that base.
> Compute rho
Solution >
phastCons HelicChr2.maf neutralModel.4d.mod --estimate-rho chr2 --expected-length 12 --target-coverage 0.25 > HelicChr2.phastCons.log
> Use the two models obtained to generate wiggle file and a bed file for the CEs (hint: use the option --most-conserved).
Solution >
phastCons --score --most-conserved HelicChr2.mostcons.bed HelicChr2.maf chr2.cons.mod,chr2.noncons.mod | sed 's/HelicChr2/Hmel202001o/' > Chr2.scores.wig
--
sed -i.BK 's/HelicChr2/Hmel202001o/' HelicChr2.mostcons.bed
--
wigToBigWig Chr2.scores.wig Hmel.Chr2.fasta.fai Chr2.scores.bw
--
> These regions can be very close to each other, sometimes the distance is of just a few nucleotides. You can choose to merge these regions, let's say 5nt apart. Since the format of PhastCons is a bed file check if bedtools can be of any help.
> How many regions did you merge?
Solution >
bedtools merge -o first,mean -c 4,5 -d 5 -i HelicChr2.mostcons.bed > HelicChr2.5b.Merged.mostcons.bed
Now load all the data on IGV and have a look. Do you see any pattern? Any region where you see lesser CEs in H. melpomene?
Part IIIc: ATAC peak enrichment
If you examine the chromosome closely, you will notice that there is a significant amount of overlap between the conserved elements (CEs), phastCons scores, and PhyloP scores with coding regions. This is expected since coding exons are typically the most conserved regions of the genome. However, upon closer inspection, you will also find CEs within intergenic and intronic regions. In this particular case, these CEs span approximately 40 million years within the phylogenetic framework. It is important to note that most of these intergenic regions do not possess a valid open reading frame (ORF). Currently, it is unknown whether these regions represent functional elements such as enhancers or promoters. However, we can perform an assay for transposase-accessible chromatin (ATAC) to investigate whether there is an enrichment of chromatin accessibility with CEs. Indicating possible role as cis-regulatory elements (CREs). important "switches" or "control switches" that influence when and how genes are turned on or off.
I have provided you with two annotations for ATAC peaks: one for H. melpomene (melp_ros.MACS2.peaks.all.Chr2.counts) and another for H. erato demophoon (dem_hyd.MACS2.peaks.all.Chr2.counts). These annotations were generated by Dr. Steven Van Belleghem and include the start and end positions of all ATAC-seq peaks obtained through the use of Macs2. The data was derived from 5th instar heads and 5th instar, day 1, and day 2 pupal wings. The counts presented in the table represent the number of reads within those peaks (similar to gene expression counts in RNA-seq) for each sample/tissue. Please note that peaks are only included in the dataset if they were identified in at least two samples.
> Convert the ATAC peaks files in a valid bed file
Solution >
grep Hmel202001o melp_ros.MACS2.peaks.all.Chr2.counts > Hmel.Chr2.ATACpeaks.bed
> Get the CEs and ATAC peaks that sit in intergenic regions [or in non-Exonic regions]. In order to do that you need to break the problems into smaller ones (hint: use bedtools intersect).
Solution >
echo -e "Extract intergenic regions first"
grep -w gene Hmel.Chr2.annotation.gff3 | grep -v transcript | bedtools complement -i - -g Hmel.Chr2.fasta.fai > Hmel.IntergenicReg.bed
--
echo -e "Extract the ATAC peaks that are located in those regions"
bedtools intersect -u -wa -a Hmel.Chr2.ATACpeaks.bed -b Hmel.IntergenicReg.bed > Hmel.Chr2.ATACpeaks.InterGenic.bed
--
echo -e "Do the same for intergenic CEs"
bedtools intersect -u -wa -a HelicChr2.5b.Merged.mostcons.bed -b Hmel.IntergenicReg.bed > HelicChr2.5b.Merged.intergCEs.bed
> Now check how many intergenic ATAC peaks overlap with intergenic CEs.
Solution >
bedtools intersect -u -wb -a Hmel.Chr2.ATACpeaks.InterGenic.bed -b HelicChr2.5b.Merged.intergCEs.bed | wc -l
> What is their proportion?
Solution >
Of the ~1292 intergenic ATAC peaks ~871 overlap with CEs, the ~67%
Extra assignment
Is this overlap significant in any way? That is, there is a significant enrichment in CEs where the chromatin is opened?
How could you test if there is a significant enrichment in CEs?
Solution >
You could run a permutation test. You can reshuffle the ATAC peaks n times and check the expected overlap. That would give you the expected assumption that chromatin is randomly opened in intergenic regions. This is your null hypothesis (Ho). If your n is large enough, you could generate a distribution of the overlap and check where the observed overlap is placed. If the frequency of the observed enrichment is lower than the 5% or the randomly distributed ATAC peaks, your observation has a P-value lower than 0.05. You can also apply a binomial distribution test.
The bedtools have an implemented operator, shuffle, that randomly replaces the elements' coordinates of a bed file.
> Have a look at it and see if you can come up with a configuration that suits our needs.
> For each replica compute the overlap and store it in a file.
> Plot the results and what is the enrichment, if there is one.
Solution >
mkdir Permutation
for REP in $(seq 1 1000); do
echo -e "Rep: $REP"
bedtools shuffle -excl Hmel.GeneRegions.bed -chrom -chromFirst -noOverlapping -i Hmel.Chr2.ATACpeaks.InterGenic.bed \
-g Hmel.Chr2.fasta.fai | bedtools sort -i - > Permutation/Hmel.Chr2.ATACpeaks.ReShuf.$REP.bed
Nline=`wc -l Permutation/Hmel.Chr2.ATACpeaks.ReShuf.$REP.bed | cut -f1 -d' '`
bedtools intersect -u -wa -a Permutation/Hmel.Chr2.ATACpeaks.ReShuf.$REP.bed -b HelicChr2.5b.Merged.mostcons.bed | wc -l | awk '{ print "'"$REP"'""\t"$0/"'"$Nline"'"}' >> Overlaps.dat
done
Solution >
FracOvl <- read.table("Overlaps.dat", header=TRUE, sep="\t")
summary(FracOvl$FracOvl)
plot(density(FracOvl$FracOvl, adjust = 0.1))
lines(density(FracOvl$FracOvl, adjust = 2, add = TRUE), col = 'green')
x <- 871
n <- 1292
binom.test(x, n, p=0.5381)