Distinguishing Somatically Mutated Antibodies from 454 Pyrosequencing Error

Brandon DeKosky
Marcotte CH391L
Georgiou Lab
University of Texas
May 2011


This report aims to use an in silico model of the 454 DNA sequencer to differentiate somatic mutation from sequencing error in immune repertoire data.  A program was constructed to model error rates of 454 sequencing and simulate sequencing the variable region of a known antibody variable region.  Then, simulation results were compared to actual 454 data to distinguish somatic mutation from sequencing error in immune repertoires.  Somatically mutated antibodies have a higher probability of displaying high affinity or high avidity to a given antigen, so the ability to confidently determine which antibodies have undergone somatic mutation will help us invest resources effectively and decide which antibody sequences to express and characterize at the laboratory bench.  Differentiating between somatic mutation and sequencing error will also be immensely valuable as we investigate changes to the mammalian immune repertoire over time.


Our research group has developed a novel method of antibody discovery using high throughput gene sequencing.1  This technique has proved useful and robust in many cases, but expressing the obtained gene sequences in bacteria to check for binding is expensive and time consuming.  Somatic mutation is one marker of antibody affinity maturation, so the ability to effectively determine which sequences have true somatic mutations and which sequences have sequencing error could save us significant amounts of time and money in this process.  Figure 1 shows data from a recent experiment.  The data has been filtered for the highest read-count CDR3 and shows all corresponding CDR1 and CDR2 variants.  (Data is sorted in this manner because the CDR3 is the most important for antigen-antibody interaction.)  We believe that Figure 1 shows a highly somatically mutated antibody, as the CDR1 and CDR2 pairings are quite diverse.  The five variants in excess of 10% are likely accurate reads.  Likewise, the lower count sequences probably originate from sequencing error, but where exactly does the transition occur?  To what extent does the observed diversity come from sequencing error, and to what extent does it originate in true somatic mutation?  I have addressed this question by constructing an in silico model of the 454 sequencer to generate simulated data for a known monoclonal population and observe output sequence population characteristics.

Figure 1 Histogram plot of various CDR1 and CDR2 combinations within the highest read-count CDR3 from a recent immune repertoire study.  Each bar represents a single variant, expressed as percentage of total sequence counts for this CDR3.

Understanding the various types, frequencies, and causes of errors in the pyrosequencing process is critical to constructing an accurate model of the 454 system.  Random point mutation, or the insertion of an incorrect base, occurs in approximately 0.17% of sequenced bases.  Base addition is denoted by a flash of color recorded by a camera, and homopolymer bases insert in rapid succession.  Thus, homopolymer runs are manifest as a slightly more intense flash and a much higher percentage of homopolymer bases have insertion or deletion errors because of the inherent difficulty quantifying flash intensity.  Homopolymer insertion and deletions are estimated to occur in approximately 0.34% of the total reads in standard samples, concentrated mainly at homopolymer runs, and probability of insertion-deletion (in-del) error increases significantly at long runs (4-5 or more consecutive bases.)  These issues have been acknowledged by pyrosequencing manufacturers2 and empirical studies have established approximate error frequencies that we can use as targets for program training.3


A Perl script was written to model 454 sequencing reads and insert point and in-del mutations in a similar pattern and frequency to the 454 machine.  Figure 2 displays a summary flowchart of the program architecture, and the complete Perl script is available as supplemental data.  The following assumptions were made:

  • Error rates should be approximately 0.5% overall, with approximately two thirds of observed error generated by homopolymer insertions and deletions.
  • Single polymer error occurs only in the form of point mutations.
  • Homopolymer error occurs only in the form of insertions and deletions.
  • The probability of insertion and deletion error increases substantially with longer homopolymer runs.

Figure 2
  A flow chart of program structure.  Red arrows correspond to a point mutation error, while blue arrows correspond to insertion-deletion errors.

The results of a recent study (mouse bone marrow plasma cell variable region sequences) were used to compare simulated reads to actual data.  A fully functional sequence was selected at random from the highest read-count variant of each of the top four ranked CDR3 families.  For instance, a fully functional sequence was selected randomly from the 17% variant population in Figure 1.  Then, sequence reads were simulated in the 454 model for approximately the same number of read counts as the number of counts for the actual data (e.g. 500 artificial reads for the top ranked CDR3 family).  Model output was then processed identically to experimental data:  it was submitted to the ImMunoGeneTics System (http://www.imgt.org/) HighV-Quest for CDR1, -2, and -3 analysis, sorted into a delimited text file using an in-house Perl script, filtered for the top-ranked CDR3, and graphed into a histogram of CDR1 and CDR2 variant combinations within the top CDR3 family.  Thus, experimental 454 data and simulated 454 data were compared directly.


Results & Discussion

Final program output yielded approximately 0.52% overall mutation rates (min 0.47%, max 0.58%, n=4), with approximately 0.34% due to homopolymer in-del error.  The returned mutation rates corresponded well to empirical observations3 and the model output thus served as a useful tool to compare with experimental data.  Figures 3-6 diplay juxtaposed real and simulated data.  Additionally, Table 1 shows the number of reads for each of the four simulations, along with the percentage of simulated reads which returned the correct CDR3 amino acid sequence after IGMT data analysis.


Figure 3 Experimental data and simulation results from the top ranked CDR3 of Mouse 23 bone marrow plasma cells, expressed as percentage variant representation in this CDR3 population.


Figure 4 Experimental data and simulation results from the 2nd ranked CDR3 of Mouse 23 bone marrow plasma cells, expressed as percentage variant representation in this CDR3 population.

Figure 5 Experimental data and simulation results from the 3rd ranked CDR3 of Mouse 23 bone marrow plasma cells, expressed as percentage variant representation in this CDR3 population.


 Figure 6 Experimental data and simulation results from the 4th ranked CDR3 of Mouse 23 bone marrow plasma cells, expressed as percentage variant representation in this CDR3 population.


Read Count

CDR3 Correct

% CDR3 Correct

















Table 1  Simulation accuracy in identifying the correct CDR3 amino acid sequence


Simulation data show great divergence from experimental data in samples that we theorize to be highly somatically mutated antibodies (CDR3 ranks 1 and 3, Figures 3 and 5). The 454 model also replicated the output population characteristics of what we theorize to be a non-somatically mutated antibody (see Figure 6), which combines with observed error rates to confirm that the simulation is a reasonable model of the 454 sequencer.  After analyzing experimental data alongside modeling results, three of the four antibody sequences appear to have some degree of somatic mutation.  As high frequency CDR3 families are expected to be more somatically mutated, this high degree of observed somatic mutation is an intuitively expected result for the top 4 CDR3 families.  A curious pattern was also observed in CDR3 read accuracy:  Simulations 2-4 obtained the correct CDR3 amino acid sequence in approximately 80% of reads (or 20% of the time a mutation was inserted which altered the CDR3 sequence), however, Simulation 1 only returned the correct CDR3 amino acid sequence in 64% of reads.  Because the model incorporates errors preferentially at homopolymer runs, it can be concluded that the difference between Simulation1 and Simulations 2-4 originates in a high-probability error at a homopolymer run inside or close to Simulation 1's CDR3 region, which becomes translated to a variant CDR3 amino acid sequence in HighV-Quest analysis.  To observe this trend directly, Simulations 1 and 2 were reanalyzed according to CDR2 families and CDR2 accuracy was computed (Tables 2 and 3, Figure 7).


Read Count

CDR2 Correct

% CDR2 Correct









Table 2  Simulation accuracy in reading CDR2 epitopes




% expression













Table 3  The top three variants of Rank 1, after sorting for the correct CDR2 counts


Figure 6  Results from Simulations 1 and 2 after filtering for the highest ranked CDR2.

As expected, filtering Simulation 1 for correct CDR2 shows that a high-probability homopolymer insertion/deletion occurred that altered the CDR3 region (Table 3), thus causing a higher CDR3 error rate.  In fact, the top erroneous read sequence accounted for nearly 8% of total reads after CDR2 filtering!  Multiple sequence alignment of the correct Simulation 1 sequence with the top 3 erroneous CDR3’s showed that the top variant originates as an insertion error changing four consecutive thymines to five at base 265.  The next-highest erroneous sequence (4.4%) occurred from in-del errors causing changes from three to four consecutive guanidines at bases 188, 301 and 316 (see multiple alignment file in supplemental data).  The multiple sequence alignment demonstrated that while main sequences should stand out significantly, separating the lower-ranked somatic mutations from the high-probability insertion and deletion errors can be difficult.  One way to circumvent this issue would be to analyze potential DNA sequences for homopolymer runs before continuing synthetic gene synthesis and cloning into bacteria.  If a sequence has a read percentage greater than 10% and contains no homopolymer runs greater than 4, then gene synthesis and cloning may be pursued without doubts.  If an antibody sequence has a large number of consecutive bases or a low frequency count, it may be analyzed by sequence alignment with the top-ranked antibodies to search for high probability insertion and deletion errors.  Then, the correct antibody sequence can be reconstructed to a high probability by determining whether an observed sequence change was likely caused by somatic mutatios (e.g. point mutations, non-homopolymer insertions or short run insertions, addition of multiple heterogeneous bases) or sequencing error (in-del changes in homopolymer runs of 3-4 or greater).


The 454 sequencing model replicated the behavior of real 454 data with reasonable accuracy throughout simulation of monodisperse antibody sequences.  This analysis revealed that it is highly probably that top ranked variants (12-20% expression and up) are truly somatic mutations.  In most cases, it is probable that middle-ranked variants (6-12%) are also somatic mutations, but the existence of homopolymer runs was observed to skew data in simulated reads.  Finally, while many lower ranked variants (0-3%) may be somatic mutations, it will be very difficult to separate somatically mutated low-ranked variant from gene sequencing error.  This report proposes a simple method for checking middle-ranked antibody sequences for potential errors before bacterial expression and characterization:  a multiple sequence alignment comparing the sequence of interest to the top-ranked sequences can be used to identify insertion/deletion errors that are likely caused by gene sequencing, or identify mutations that are more likely induced somatically.  This work has lent a greater degree of confidence to our data analysis and will conserve laboratory resources by enhancing our ability to select winning antibodies.



1.   Reddy, S.T., et al., Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nature biotechnology, 2010. 28(9): p. 965-U20.
2.   Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2006. 441(7089): p. 120-120.
3.   Huse, S.M., et al., Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology, 2007. 8(7)

Supplemental Data


Rank 1 CDR3 error multiple alignment file