William Gilks - QC of NGS data using R

Quality-control analysis of next-generation sequencing data

Raw data is generated by the platforms SamTools, Picard, and GATK duing read-mapping of paired-end WGS sequencing data to reference genome. Code below is in R https://www.r-project.org/ Publication is at https://f1000research.com/articles/5-2644/v3

Package installation if not present

This approach allows users to speed-up many package installation steps. The packages used in this script are standard for data formatting, basic calculations and graphics in large data-sets. The package 'tidyverse' now encompasses the first three packages listed.

package.list = c('tidyr', 'dplyr', 'ggplot2', 'cowplot')

new.packages = package.list[!(package.list %in% installed.packages()[,"Package"])]

if (length(new.packages)) install.packages(new.packages)

rm(package.list, new.packages)

Define a common plot theme

I normaly use ggplot for generating plots, The coding is fluid and the options match large multi-dimensional data-sets. Unfortunately, the available plot themes are unsatisfactory, so I define my own:

theme_f1 <- function(){

  theme_bw(base_size = 9) +

    theme(

      plot.margin = unit(c(.25, .5, .25, .25), "cm"),

      axis.line.y = element_line(size = .3, colour = "black"),

      axis.line.x = element_line(size = .3, colour = "black"),

      legend.title = element_blank(),

      legend.background =  element_rect(fill = "white", size = 0.25, linetype = 1, colour = "grey"),

      legend.direction = "vertical",

      legend.position = c(.75, .2),

      legend.key = element_rect(colour = "white"),

      legend.key.size = unit(1.2, "line"),

      panel.grid = element_line(colour = "light grey", size = 0.1))}

attr(theme_f1(), "complete")

Define a custom function to load the different types of data formats.

The quality control data generated by of programs such as Picard, SamTools and GATK is well-organised in unix-format flat text files although there are differences in the number of header rows etc. Another minor challenge is that there is of course, a different file for each sample, which need to be combined for analysis. Note that all of the data is in 'long' format, i.e. that each value has its own row.

This custom function aims to load all the files for each sample and data type into a single R object. pat is the search term for the input file, subc is what to substitute in the text string when binding the file name as a column. 'skp' is how many of the header rows to skip. seper is how the gap between columns in the input data file are defined. e.g. tab or space.

The function loads all the files ending in the specifed prefix into a list object, and adds the name of the file as an extra column. It then binds all the components of the list into a single table, and edits the string variables to be practically readable. (Derived mostly from StackOverflow)

readem = function(pat, subc, skp, khead, seper) {

  a = lapply(list.files(pattern = pat, recursive = TRUE),

             function(x) cbind(run_name = gsub(subc, "", x),

                    read.table(x, skip = skp, header = khead, sep = seper)))

  b = do.call("rbind", a)

  rm(a)

  b$run_name = sub(".*/", "", b$run_name)

  b$run_name = gsub("1265-","", b$run_name)

  return(b)

Data Loading

Load per-sequence-quality data

Here, the value is Count, i.e. how many sequence reads that were in each Quality group for each sample and direction.

perseq = readem(pat = "*.fastq.perSeqQual.txt", subc = ".fastq.perSeqQual.txt", skp = 1, khead = TRUE, seper = "\t")

perseq_data = perseq %>%

  separate(run_name, into = c("fly", "adaptor", "lane", "direction", "thing"), sep = "_", remove = FALSE) %>%

  select(run_name, fly, direction, Quality, Count ) %>%

  arrange(fly)

Plot distribution of per sequence quality

squal.distr =

  ggplot(perseq_data, aes(Quality, Count, group = run_name, colour = direction)) +

    theme_f1() +

      geom_vline(xintercept = 20, linetype = 2, size = 0.25, colour = "dark grey") +

      geom_line(size = .1, alpha = .1) +

      scale_x_continuous(name = "Sequence read quality score", labels = seq(0,40,10)) +

      scale_y_log10(expand = c(0,0), limits = c(10, 1e+08), name = "Sequence read counts (log scale)",

                     breaks = scales::trans_breaks("log10", function(x) 10 ^ x),

                     labels = scales::trans_format("log10", scales::math_format(10 ^ .x))) +

  scale_colour_manual(values = c("black", "red"), labels = c("First of pair", "Second of pair")) +

  guides(colour = guide_legend(override.aes = list( size = .7, alpha = .8))) +

  annotation_logticks(sides = "l", size = 0.1) +

  annotate("text", angle = 90, x = 19, y = 9.5 ^ 7, label = "DISCARDED", size = 2) +

  annotate("text", angle = 90, x = 21, y = 9.5 ^ 7, label = "KEPT", size = 2)

Load per-base-quality data

perbase = readem("*.perBaseQual.txt", ".perBaseQual.txt", 1, TRUE, "\t")

Format Quality-per-base data

Remove awkward characters from position column, extract fly names & read direction

Convert very high and very low numbers to human readable, for graphing.

perbase_data = perbase %>%

  mutate(start = as.numeric(gsub("-.*","", perbase$Base))) %>%

  separate(run_name, into = c("fly", "adaptor", "lane", "direction", "thing"), sep = "_", remove = FALSE) %>%

  select(run_name, fly, direction, start, Mean)

bcd$milcount = bcd$Count / 1.0e+06

er$thouErr = er$errorrate * 1.0e+03

Plot quality per base

bqual.distr =

  ggplot(perbase_data, aes(start, Mean, group = run_name, colour = direction)) +

  theme_f1() +

  geom_line(size = .1, alpha = .1) +

  scale_y_continuous( name = "Nucleotide base quality\n(Phred-scaled, max=40)", expand = c(0,0), limits = c(15,41)) +

  scale_x_continuous( name = "Read position", expand = c(0,0), limits = c(0,100)) +

  scale_colour_manual(values = c("black", "red"),

                      labels = c("First of pair", "Second of pair")) +

  guides(colour = guide_legend(override.aes = list( size = .7, alpha = .8)))

Load 'base coverage distribution' data

This is the number of bases in the entire genome which are covered at each depth level. I've calculated this in millions. Also note that for calculating the mean or median coverage, it's probably best to omit the zero group.

bcd = readem("*.BasCovDis.txt", ".BasCovDis.txt", 3, TRUE, "")

Calculate modal depth for each fly

omitted genomic regions with zero (as zero regions are the same for all samples an thus uninformative basically)

max.covs = bcd %>%

  filter(Coverage != 0) %>%

  group_by(run_name) %>%

  slice(which.max(milcount))

cov.maxes = data.frame(max.covs$run_name, max.covs$Coverage)

cov.maxes$measure = "Modal depth"

mean.cov = median(cov.maxes$max.covs.Coverage)

iqr.cov = IQR(cov.maxes$max.covs.Coverage)

Plot sequencing coverage distribution for each sample, and make a marginal boxplot

Although this is essentially a histogram, the data have been pre-calculated and so can be plotted as a line graph, with one series per sample. The area under the graph is equivalent to the total number of reads per sample.

cov.distr =

  ggplot(bcd, aes(Coverage, milcount, group = run_name, colour = run_name)) +

  theme_f1() +

  geom_line(lwd = .1, alpha = .8) +

  coord_cartesian(xlim = c(0,110), ylim = c(0,9), expand = FALSE) +

  scale_x_continuous(name = "Depth of coverage bin", breaks = seq(0,110,10)) +

  scale_y_continuous(name = expression("Bases covered," %*% 10 ^ {6}), breaks = seq(0,10,2)) +

  theme(legend.position = "none",

        axis.title.x = element_text(colour = "white"),

        axis.text.x = element_text(colour = "white"))

cov.box =

  ggplot(cov.maxes, aes(measure, max.covs.Coverage)) +

  stat_boxplot(geom = 'errorbar', size = .4, colour = "grey5", width = 0.667) +

  geom_boxplot(outlier.size = 0, outlier.colour = "white", size = .3, colour = "grey5", fill = "white") +

  geom_jitter(width = 0.7, shape = 17, size = 1.5, aes(measure, max.covs.Coverage, colour = max.covs.run_name)) +

  coord_flip() +

  theme_f1() +

  scale_y_continuous(name = "Modal depth of coverage per fly", expand = c(0,0), limits = c(0,110), breaks = seq(0,110,10), labels = seq(0,110,10)) +

  scale_x_discrete(name = "dummy \n dummy", breaks = NULL) +

  theme(legend.position = "none",

          panel.grid.major.x = element_line(colour = "grey90", size = .5),

          panel.grid.major.y = element_line(colour = "white"),

          panel.grid.minor = element_line(size = 0),

          axis.title.y = element_text(colour = "white"),

          plot.margin = unit(c(0, .5, .25, .25), "cm"))

Load error rate (mis-match) data

Differences in the sequenced sample from the reference sequence. Amazingly the sequencing reads can be traced to which PCR cycle they were generated in. Gives an indication of real errors.

er = readem("*.error_rates.gatkreport.txt", ".error_rates.gatkreport.txt", 3, TRUE, "")

Calculate mean mis-match rate per sample

First across LHM hemiclone samples, then across the reference strain samples.

lhm.er = er %>% filter(run_name != "RGfi", run_name != "RGil") %>%

  group_by(run_name) %>%

  summarise(med = median(errorrate), iqr = IQR(errorrate)) %>%

  summarise(mean.er = mean(med), mean.iqr = mean(iqr))

rg.er = er %>% filter(run_name == "RGil" | run_name == "RGfi") %>%

  group_by(run_name) %>%

  summarise(med = median(errorrate))

Plot mis-match rates per PCR cycle

Mis-matches are differences in the sequenced genome from the reference geneome. They can be real biological differences or introduced by error in the sequencing chemistry. The data generated by SamTools on the sequence alignment files (bam) is therefore useful for detecting problems with certain samples.

err.rat =

  ggplot(er, aes(cycle, thouErr)) +

  theme_f1() +

  geom_line(aes(group = run_name, colour = run_name), lwd = .1, alpha = 1) +

  coord_cartesian(xlim = c(0, 100), ylim = c(0, 10)) +

  scale_y_continuous(expand = c(0, 0), name = expression("Mismatches to reference genome," %*% 10 ^ {-3})) +

  scale_x_continuous(expand = c(0, 0), name = "PCR cycle number", breaks = seq(0, 100, 10)) +

  theme(legend.position = "none")

Combine plots into one figure

Here, I'm forced to use the draw_plot() function which requires inelegant manual specification of plot co-ordinates. This is only really necessitated because I'm using a custom marginal boxplot. If a single plotting function is used for this, then it's probably possible to plot all four plots using glot_grid.

fig1 = ggdraw() +

  draw_plot(squal.distr, x = 0, y = .5, width = 0.5, height = .5) +

  draw_plot(cov.distr, 0, .145, 0.5, .355) +

  draw_plot(cov.box, 0, 0, 0.5, .19) +

  draw_plot(bqual.distr, 0.5, 0.5, 0.5, 0.5) +

  draw_plot(err.rat, 0.5, 0, .5, .5) +

  draw_plot_label(c("A", "C", " ", "B", "D"), x = c(0,0,0, 0.5, 0.5), y = c(1, .51, .21, 1, .51), size = 10)

save_plot("fig.seqQual.pdf", fig1, base_aspect_ratio = 1.5, base_height = 6)

Input data generated from code here https://zenodo.org/record/809328

Log files and further explaination for the published data available here https://zenodo.org/record/159282

Publication is at https://f1000research.com/articles/5-2644/v3

25th March 2018

Google Sites

Report abuse