Raw data is generated by the platforms SamTools, Picard, and GATK duing read-mapping of paired-end WGS sequencing data to reference genome. Code below is in R https://www.r-project.org/ Publication is at https://f1000research.com/articles/5-2644/v3
This approach allows users to speed-up many package installation steps. The packages used in this script are standard for data formatting, basic calculations and graphics in large data-sets. The package 'tidyverse' now encompasses the first three packages listed.
package.list = c('tidyr', 'dplyr', 'ggplot2', 'cowplot')
new.packages = package.list[!(package.list %in% installed.packages()[,"Package"])]
if (length(new.packages)) install.packages(new.packages)
rm(package.list, new.packages)
I normaly use ggplot for generating plots, The coding is fluid and the options match large multi-dimensional data-sets. Unfortunately, the available plot themes are unsatisfactory, so I define my own:
theme_f1 <- function(){
theme_bw(base_size = 9) +
theme(
plot.margin = unit(c(.25, .5, .25, .25), "cm"),
axis.line.y = element_line(size = .3, colour = "black"),
axis.line.x = element_line(size = .3, colour = "black"),
legend.title = element_blank(),
legend.background = element_rect(fill = "white", size = 0.25, linetype = 1, colour = "grey"),
legend.direction = "vertical",
legend.position = c(.75, .2),
legend.key = element_rect(colour = "white"),
legend.key.size = unit(1.2, "line"),
panel.grid = element_line(colour = "light grey", size = 0.1))}
attr(theme_f1(), "complete")
The quality control data generated by of programs such as Picard, SamTools and GATK is well-organised in unix-format flat text files although there are differences in the number of header rows etc. Another minor challenge is that there is of course, a different file for each sample, which need to be combined for analysis. Note that all of the data is in 'long' format, i.e. that each value has its own row.
This custom function aims to load all the files for each sample and data type into a single R object. pat is the search term for the input file, subc is what to substitute in the text string when binding the file name as a column. 'skp' is how many of the header rows to skip. seper is how the gap between columns in the input data file are defined. e.g. tab or space.
The function loads all the files ending in the specifed prefix into a list object, and adds the name of the file as an extra column. It then binds all the components of the list into a single table, and edits the string variables to be practically readable. (Derived mostly from StackOverflow)
readem = function(pat, subc, skp, khead, seper) {
a = lapply(list.files(pattern = pat, recursive = TRUE),
function(x) cbind(run_name = gsub(subc, "", x),
read.table(x, skip = skp, header = khead, sep = seper)))
b = do.call("rbind", a)
rm(a)
b$run_name = sub(".*/", "", b$run_name)
b$run_name = gsub("1265-","", b$run_name)
return(b)
}
Here, the value is Count, i.e. how many sequence reads that were in each Quality group for each sample and direction.
perseq = readem(pat = "*.fastq.perSeqQual.txt", subc = ".fastq.perSeqQual.txt", skp = 1, khead = TRUE, seper = "\t")
perseq_data = perseq %>%
separate(run_name, into = c("fly", "adaptor", "lane", "direction", "thing"), sep = "_", remove = FALSE) %>%
select(run_name, fly, direction, Quality, Count ) %>%
arrange(fly)
squal.distr =
ggplot(perseq_data, aes(Quality, Count, group = run_name, colour = direction)) +
theme_f1() +
geom_vline(xintercept = 20, linetype = 2, size = 0.25, colour = "dark grey") +
geom_line(size = .1, alpha = .1) +
scale_x_continuous(name = "Sequence read quality score", labels = seq(0,40,10)) +
scale_y_log10(expand = c(0,0), limits = c(10, 1e+08), name = "Sequence read counts (log scale)",
breaks = scales::trans_breaks("log10", function(x) 10 ^ x),
labels = scales::trans_format("log10", scales::math_format(10 ^ .x))) +
scale_colour_manual(values = c("black", "red"), labels = c("First of pair", "Second of pair")) +
guides(colour = guide_legend(override.aes = list( size = .7, alpha = .8))) +
annotation_logticks(sides = "l", size = 0.1) +
annotate("text", angle = 90, x = 19, y = 9.5 ^ 7, label = "DISCARDED", size = 2) +
annotate("text", angle = 90, x = 21, y = 9.5 ^ 7, label = "KEPT", size = 2)
perbase = readem("*.perBaseQual.txt", ".perBaseQual.txt", 1, TRUE, "\t")
Remove awkward characters from position column, extract fly names & read direction
Convert very high and very low numbers to human readable, for graphing.
perbase_data = perbase %>%
mutate(start = as.numeric(gsub("-.*","", perbase$Base))) %>%
separate(run_name, into = c("fly", "adaptor", "lane", "direction", "thing"), sep = "_", remove = FALSE) %>%
select(run_name, fly, direction, start, Mean)
bcd$milcount = bcd$Count / 1.0e+06
er$thouErr = er$errorrate * 1.0e+03
bqual.distr =
ggplot(perbase_data, aes(start, Mean, group = run_name, colour = direction)) +
theme_f1() +
geom_line(size = .1, alpha = .1) +
scale_y_continuous( name = "Nucleotide base quality\n(Phred-scaled, max=40)", expand = c(0,0), limits = c(15,41)) +
scale_x_continuous( name = "Read position", expand = c(0,0), limits = c(0,100)) +
scale_colour_manual(values = c("black", "red"),
labels = c("First of pair", "Second of pair")) +
guides(colour = guide_legend(override.aes = list( size = .7, alpha = .8)))
This is the number of bases in the entire genome which are covered at each depth level. I've calculated this in millions. Also note that for calculating the mean or median coverage, it's probably best to omit the zero group.
bcd = readem("*.BasCovDis.txt", ".BasCovDis.txt", 3, TRUE, "")
omitted genomic regions with zero (as zero regions are the same for all samples an thus uninformative basically)
max.covs = bcd %>%
filter(Coverage != 0) %>%
group_by(run_name) %>%
slice(which.max(milcount))
cov.maxes = data.frame(max.covs$run_name, max.covs$Coverage)
cov.maxes$measure = "Modal depth"
mean.cov = median(cov.maxes$max.covs.Coverage)
iqr.cov = IQR(cov.maxes$max.covs.Coverage)
Although this is essentially a histogram, the data have been pre-calculated and so can be plotted as a line graph, with one series per sample. The area under the graph is equivalent to the total number of reads per sample.
cov.distr =
ggplot(bcd, aes(Coverage, milcount, group = run_name, colour = run_name)) +
theme_f1() +
geom_line(lwd = .1, alpha = .8) +
coord_cartesian(xlim = c(0,110), ylim = c(0,9), expand = FALSE) +
scale_x_continuous(name = "Depth of coverage bin", breaks = seq(0,110,10)) +
scale_y_continuous(name = expression("Bases covered," %*% 10 ^ {6}), breaks = seq(0,10,2)) +
theme(legend.position = "none",
axis.title.x = element_text(colour = "white"),
axis.text.x = element_text(colour = "white"))
cov.box =
ggplot(cov.maxes, aes(measure, max.covs.Coverage)) +
stat_boxplot(geom = 'errorbar', size = .4, colour = "grey5", width = 0.667) +
geom_boxplot(outlier.size = 0, outlier.colour = "white", size = .3, colour = "grey5", fill = "white") +
geom_jitter(width = 0.7, shape = 17, size = 1.5, aes(measure, max.covs.Coverage, colour = max.covs.run_name)) +
coord_flip() +
theme_f1() +
scale_y_continuous(name = "Modal depth of coverage per fly", expand = c(0,0), limits = c(0,110), breaks = seq(0,110,10), labels = seq(0,110,10)) +
scale_x_discrete(name = "dummy \n dummy", breaks = NULL) +
theme(legend.position = "none",
panel.grid.major.x = element_line(colour = "grey90", size = .5),
panel.grid.major.y = element_line(colour = "white"),
panel.grid.minor = element_line(size = 0),
axis.title.y = element_text(colour = "white"),
plot.margin = unit(c(0, .5, .25, .25), "cm"))
Differences in the sequenced sample from the reference sequence. Amazingly the sequencing reads can be traced to which PCR cycle they were generated in. Gives an indication of real errors.
er = readem("*.error_rates.gatkreport.txt", ".error_rates.gatkreport.txt", 3, TRUE, "")
First across LHM hemiclone samples, then across the reference strain samples.
lhm.er = er %>% filter(run_name != "RGfi", run_name != "RGil") %>%
group_by(run_name) %>%
summarise(med = median(errorrate), iqr = IQR(errorrate)) %>%
summarise(mean.er = mean(med), mean.iqr = mean(iqr))
rg.er = er %>% filter(run_name == "RGil" | run_name == "RGfi") %>%
group_by(run_name) %>%
summarise(med = median(errorrate))
Mis-matches are differences in the sequenced genome from the reference geneome. They can be real biological differences or introduced by error in the sequencing chemistry. The data generated by SamTools on the sequence alignment files (bam) is therefore useful for detecting problems with certain samples.
err.rat =
ggplot(er, aes(cycle, thouErr)) +
theme_f1() +
geom_line(aes(group = run_name, colour = run_name), lwd = .1, alpha = 1) +
coord_cartesian(xlim = c(0, 100), ylim = c(0, 10)) +
scale_y_continuous(expand = c(0, 0), name = expression("Mismatches to reference genome," %*% 10 ^ {-3})) +
scale_x_continuous(expand = c(0, 0), name = "PCR cycle number", breaks = seq(0, 100, 10)) +
theme(legend.position = "none")
Here, I'm forced to use the draw_plot() function which requires inelegant manual specification of plot co-ordinates. This is only really necessitated because I'm using a custom marginal boxplot. If a single plotting function is used for this, then it's probably possible to plot all four plots using glot_grid.
fig1 = ggdraw() +
draw_plot(squal.distr, x = 0, y = .5, width = 0.5, height = .5) +
draw_plot(cov.distr, 0, .145, 0.5, .355) +
draw_plot(cov.box, 0, 0, 0.5, .19) +
draw_plot(bqual.distr, 0.5, 0.5, 0.5, 0.5) +
draw_plot(err.rat, 0.5, 0, .5, .5) +
draw_plot_label(c("A", "C", " ", "B", "D"), x = c(0,0,0, 0.5, 0.5), y = c(1, .51, .21, 1, .51), size = 10)
save_plot("fig.seqQual.pdf", fig1, base_aspect_ratio = 1.5, base_height = 6)
Input data generated from code here https://zenodo.org/record/809328
Log files and further explaination for the published data available here https://zenodo.org/record/159282
Publication is at https://f1000research.com/articles/5-2644/v3
25th March 2018