Determination of antigen-specific antibody design criteria using cluster analysis

 

Project for CH391L

 

Kam Hon Hoi and Zhen Xia

Introduction

The immune system is an indispensable host defense measure against harmful pathogens. The cooperative protections from the innate immunity and the adaptive immunity attribute to the robust immune response against harmful antigens. In particular, the adaptive immunity can respond swiftly to provide memory and long-term protections upon pathogen re-exposure1. A major component of this protection is carried out in the form of antibody. The dual functionalities of an antibody to recognize and to confer effector functions make it a versatile and effective component in adaptive immunity. Antibody can be of several isotypes but the most abundantly found in blood circulation is the immunoglobulin G isotype (IgG)2. A full-length IgG typically has a molecular weight of about 150 kDa and it is composed of two types of polypeptide chains: heavy chain and light chain as shown in Figure 1.

Figure 1. Full-length IgG structure 2.

As depicted in Figure 1, there are two functionally different regions on an antibody. The constant region is where the effector function is conveyed and it also provides the supporting structure for the antibody. The variable region comprises the variable light (VL) and variable heavy (VH) portions of the antibody. This region is the interface for non-covalent antigen binding/recognition, hence a diverse variable region allows for wide-range of interaction/recognition to various antigenic domains or epitopes. Sequence variability analysis reveals that the variable regions in VL and VH consist of three highly variable regions called complementarity-determining regions (CDR1,2,3) as shown in Figure 2 and the remaining regions are relatively constant suggesting their structural support role as oppose to the recognition role by CDRs.

Figure 2. Highly variable regions in VL and VH 2.

The diversity found in VL and VH is obtained via several cellular mechanisms: V(D)J recombination and somatic hyper mutations3. The details of such mechanisms are extensively reviewed in immunology textbook 2and other seminal papers 4-6, so there will not be further discussion here. Nonetheless, the immune system, with the help of the above mechanisms, has the ability to generate a diversity of up to 1012 different CDRs. Therefore, one can almost certainly find an antigen-specific antibody that the immune system has been exposed to. Thus, the physicochemical, structural, and sequence information intrinsically recorded in the CDRs can provide valuable information conferring the determinant criteria for the specific recognitions of antigenic epitopes. The utilization of immune system to find therapeutically relevant antibody has been practiced for decades, for example, vaccinations and detection of biomarkers. Recent technology like Hybridoma7exploits the antibody producing cells generated by the immune system to identify antigen-specific clones for later-stage recovery of valuable antigen-specific antibody for use as biotherapeutics. With the recent reduction in sequencing costs, Reddy et al. developed a practical next-generation sequencing based platform to mine the immune repertoire for antigen-specific antibodies8. Such platform takes advantage of the ontogeny and development of antibody producing immune cells (plasma cells) to screen for highly antigen-specific antibodies. As shown in Figure 3, during B cells development towards antigen producing plasma cells, these B cells are subjected to various developmental stages of selection before ultimately differentiating to plasma cells.

Figure 3. Schematic diagram for B cell development 5.

These stages of development assist in the optimization of antigen specificity of the antibody expressed in the B cells via the V(D)J recombination or somatic hyper mutations . Hence, as these B cells terminally differentiate to plasma cells, they would have optimized their antibodies to be highly antigen-specific. As a result, deep sequencing of the antibody sequences from these cell populations would theoretically give rise to highly antigen-specific antibodies.

In this CH391L project, plasma cell antibody VH sequences obtained from three female BALB/c mice individually immunized with ova albumin(OVA), C1 complement complex component (C1s), or Human B-cell regulator of IgH transcription (Br) were analyzed. Only the top 20% highly polarized VH sequences were considered in the project, which would mainly represent antigen-specific antibody VH sequences. It was shown that VH sequences, especially the CDR3 regions, had a more dominant effect on antigen specificity 9. Therefore, to limit the project to a manageable scope, only VH sequences were analyzed as a proof-of-concept of our approach. In order to determine the underlying determinant criteria for each antigens, clustering of the pooled sequences using various parameters was performed. Parameters that were able to deconvolute the pooled sequences into their respective groups would be considered the determinant criteria. Different combinations of various parameters were tested to determine their relevance to antigen-specificity. These determinant criteria would provide informative guidance to the de novo design of antibodies against the tested antigens. The proof-of-concept of this project could hopefully lead to future work of the scale-up of more antigen sets/families being analyzed for their determinant criteria. The goal was to be able to generate more general guidance to the de novo design of antibodies against families of antigens. With sufficient guidance/rules being accumulated, a database would be envisioned to provide de novo and in silico design of highly antigen-specific antibodies that could also be affinity improved via rational design.

 

Materials and Methods

Animal model and immunization

Female BALB/c mice at 6 weeks of age were obtained from Jackson laboratory (Bar Harbor, ME). Mice were housed at the Animal Resource Center in the University of Texas at Austin. Immunization protocols were conducted following the guidelines of the Institutional Animal Care and Use Committees at the University of Texas at Austin. Purified ova albumin [OVA] (Sigma), purified C1s [C1s] (CalBiochem), or human B-cell regulator of IgH transcription antigens [Br] (Georgiou Lab) were reconstituted in sterile PBS at 1 mg/mL. Primary immunizations were conducted with backpad subcutaneous injections at 25 ug of antigens emulsified with 25 uL of complete Freund's adjuvant (CFA, Pierce Biotechnology). On day 21 after the primary immunizations, secondary immunizations with intraperitoneal injections at 25 ug of antigens emulsified with 50 uL of incomplete Freund's adjuvant (IFA, Pierce Biotechnology) were performed. Five days after the secondary immunizations, mice were euthanized using CO2 asphyxiation. Femurs and tibia were collected from each mouse for plasma cell isolations.


Plasma cells preparations and Total RNA isolations

Harvested femurs and tibia were stored in buffer number 1 solutions as described in previous study 8. Bone marrow single cell suspension was flushed using 27 guage needle connected to 10 cc syringe and gentle mixing after filtering through a 70 um filter. Plasma cell enrichments were done with CD138 biotin conjugated antibody (eBioscience) and streptavidin Dynabead conjugates (Invitrogen) for magnetically assisted cell sorting (MACS, Miltenyi). Enriched plasma cells were subjected to total RNA isolations using the Ribopure RNA isolation kit (Ambion). Protocols supplied by the manufacturer were followed. Isolated total RNAs were stored at -20 degC freezer.

 

cDNA generation and next-generation sequencing

First-strand cDNA generation was performed with 500 ng of isolated total RNA using Retroscript kit (Ambion). After first strand synthesis, PCR amplification was performed to amplify the VH genes with a mixture of pre-determined primers and temperature profiles described in Reddy et al.’s study8. Gel-purified PCR products were submitted to Genomic Sequencing and Analysis Center at the University of Texas at Austin for Roche GS-FLX 454 DNA sequencing.

 

Bioinformatics analysis

Sequences retrieved from 454 sequencing were analyzed using the ImMunoGeneTics (IMGT) database10. Results from IMGT were parsed using Perl scripts for the compilation of relevant information such as CDRs identifications for the analysis. After determining the frequencies of the CDR3s, the top 20% of highly prevalent CDR3s' VH sequences (about 2,000 sequences from each antigen) were retrieved for the clustering analysis. All VH sequences were aligned using the Muscle program11. A refinement was performed afterward to improve the alignment at highly variable regions: CDR1, CDR2, and CDR3. In order to make equal contributions of the sequences from each antigen, 500 sequences were randomly selected from each antigen specific sequence pool (1,500 sequences total). The phylogenetic relationships of the above described randomly selected sequences were generated using MEGA package12. Due to the limitation of the clustering program, further reduction of sequence number was needed to accommodate the restrictions. 200 sequences from each antigen specific sequence pool were randomly selected (600 sequences total) and used for the clustering analysis with combination of key features. The key features, herein described included the amino acid frequency of each VH sequence, the amino acid length of CDR1 and CDR3 regions, the hydrophabicity, the count number of Arginine (R) and Lysine (K) in CDR3 region, were selected for. The clustering analysis and results display were implemented using Cluster and Treeview package by Eisen et al. 13.

 

Results

Phylogenetic relationships

The phylogenetic tree was built to investigate phylogenetic relationship among all the sequences as shown in Figure 4. The Neighbor-Joining method was used to generate the tree. In Figure 4, sequences belonging to the same antigen specificity are shown with the same color. The sizes of the triangles reflect the number of sequences in subtrees. A majority of the sequences (~80% of all sequences) can be clustered into three independent triangles which are shown in red, green, and olive green on Figure 4.  However, not all sequences can be grouped to the same branch as several sequences are segregated from the major group as observed on Figure 4. However, these smaller groups tend to stay relatively close to their respective major groups with the exception of HEL-specific sequences (olive green) which are scattered throughout the tree. This result led to our belief that the dataset should contain inherent specific identities/parameters that describe the essential features for antigen specificity. Hence, we proceeded with the explorations of clustering algorithms and parameters that best deconvolute the sequences to their respective groups.

Figure 4. Phylogenetic tree of all 3,000 antigen-specific sequences. The sequences belong to the same antigen are combined together with the same colors. The sizes of the triangles reflect the number of the sequences in the subtrees. The figure is plotted by MEGA program 12.

 

Amino acids frequency

Amino acids frequency should be the most fundamental parameter that one can use for clustering analysis. Therefore, we proceeded with a cluster analysis with amino acids frequency based on each VH sequences. The agglomerative hierarchical cluster (AHC) analysis method was used to assess the overall tree structures of all VH sequences. Then the k-means cluster method was conducted using AHC results as the initial estimate of k value. Euclidean distance was used as the similarity matrix in both clustering algorithms. Roughly speaking, the VH sequences were well divided into their respective antigen-specific subgroups using AHC. Most sub-clusters are dominated with single antigen-specific family of sequences as shown in Figure S1.According to the distance-between-merged-clusters from the AHC results, we found a small number sequence (~8% of total sequence) is grouped in one cluster. Therefore, we chose the value 4 instead of 3 as the initial cluster number in k-means clustering. Other k values (k=3 to 10) were also used for comparison during the analysis. However, the best result seemed to have come from k=4 k-means clustering, of which the data are shown on Figure 5.The antigen C1s, Br, and HEL specific VH sequences dominate cluster I (84.1% are C1s), II (88.3% are Br) and III (96.9% are HEL), respectively. While the HEL specific sequences have the strongest trend to form a group, the fourth cluster (IV) contains smaller number of sequences and the population of each antigen-specific sequences are similar. No dominating group can be discerned in cluster IV. The detailed clustering results are shown in Figure S2.

Figure 5. The sequence components of each sub-cluster. The clustering was performed by k-means algorithm with the number of sub-cluster k equal to 4. The sequences collected for clustering are from 3 different antigen-specific types: antigen C1s (colored in blue), Br (colored in red) and HEL (colored in green).

 

VH positional amino acids usage

It was shown above that a majority of the sequences in each cluster were the correct antigen-specific sequences. These strong groupings may indicate unique antigen recognition functionalities. Given the knowledge of antibody folding structure, it can be determined which segments of the VH sequence could belong to the frame work regions or the CDRs. Therefore, by identifying regions such as CDRs, one can speculate the antigen recognition uniqueness. According to the sequence alignment, a total of 123 positions of VH sequence (including CDR1 to CDR3 segments) were determined and the major amino acids usage was reported. As expected, the regions with the most diversity are usually in the CDRs. A portion of the segments from the three antigen-specific VH sequences is shown on Table 1 where the major amino acids usage is reported. As observed from Table 1, position 101 to position 109 pertaining to a CDR contains the most diversity. CDR3 for C1s VH sequences are found to be ARSDRYDGYFDY, for Br VH sequences are found to be ARDDYGNYFDY, and for HEL VH sequences are found to be AREYGNYFDY.

 

Table1. The amino acid type of maximum frequency in each position.

              Position

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

C1s seq

Y

Y

C

A

R

S

D

R

Y

D

G

Y

GAP

GAP

F

D

Y

W

G

Q

G

Br seq

Y

Y

C

A

R

D

D

Y

GAP

GAP

G

N

Y

GAP

F

D

Y

W

G

Q

G

HEL seq

Y

Y

C

A

R

E

GAP

GAP

GAP

GAP

Y

G

N

Y

F

D

Y

W

G

Q

G

 

Additionally, physical properties of amino acids from the four subgroups were investigated as a potential descriptor for the uniqueness of the VH sequences. The four subgroups are: the non-polar group (including A, F, G, I, L, M, P, V, and W), the polar group (including C, N, Q, S, T, and Y), the positively charged group (H, K, and R), and the negatively charged group (D and E).After identifying the distribution of the four subgroups for each antigen-specific VH sequences. It was found that the distributions were very similar among the sequences and no distinctive deviations could be referenced as a descriptor for the uniqueness of the VH sequences.

 

Other characteristics of antigen-specific VH sequences

Additional characteristics pertaining to the antigen-specificity of VH sequences were explored as parameters for efficient clustering of pooled sequences into their respective antigen-specific groups. A total of five independent parameters was investigated, including 1) the length of CDR1 in amino acid sequence, 2) the length of CDR3 in amino acid sequence, 3) hydrophabicity, 4) CDR3 Arginine (R) count, and 5) CDR3 Lysine (K) count. These parameters were compiled for clustering. k-means clustering (with k=3, 4, and 5) was performed after data normalization. In order to determine effective parameters, various combinations of the above five parameters were used as expression vectors for clustering efficiency comparisons.  The clustering results using parameter set A (1 to 5), parameter set B (2 to 5), and parameter set C (3 to 5) are shown in Figure S3, Figure S4, and Figure S5 respectively. The clustered components of each groups are shown in Figure 6. When parameter set A was used, the C1s and HEL sequences were not able to cluster effectively as shown in Figure 6a. On the other hand, when parameter set B was used, distinctive clusters containing a majority of correct antigen-specific VH sequences were achieved as shown in Figure 6b. When parameter set C was used, slightly less distinctive clusters were achieved as shown in Figure 6c. Although the clustering was less than ideal for this case, only cluster I seemed to be affected by this parameter set. Cluster II, III, and IV seemed to share similar performance as compared with the results from parameter set B. Hence, parameter set B seems to allow for effective clustering of antigen-specific VH sequences. A summary of the averaged values of each parameters in set B is shown in Table 2.


Figure 6. The sequence components of each sub-cluster. The clustering was performed by k-means algorithm with the number of sub-cluster k equal to 4. The sequences collected for clustering are from 3 different antigen-specific types: antigen C1s (colored in blue), Br (colored in red) and HEL (colored in green). a) result from parameter set A 1) the length of CDR1 region (amino acid sequence), 2) the length of CDR3 region, 3) hydrophabicity, 4) the number of Arg in CDR3 region, and 5) the number of Lys in CDR3 region. b) result from parameter set B 2) to 5) mentioned above. c) result from parameter set C 3) to 5) that mentioned above.


Table 2: Average values of each parameter in the best performing parameter set B.

Parameters

C1s

Br

HEL

CDR3 region length

11.33±1.57

11.99±2.43

10.89±0.90

Hydrophabicity

-1.16±0.88

-1.27±0.50

-1.22±0.33

Arginine count

1.67±0.57

0.99±0.39

1.63±0.62

Lysine count

0.018±0.13

0.075±0.47

0.076±0.26


Discussion

Various clustering parameters were utilized to identify underlying components that allow for effective clustering. Such critical parameters can be used as design guidance in de novo generation of antigen specific antibodies. As this project has shown, several combinations of parameters can sufficiently and effectively cluster the sequences into their antigen specific groups. The implication of such relationships can be inferred as the signature criteria necessary for the antigen recognition. As the results may have suggested that not a single parameter can provide us with perfect clustering but a combination was needed to result to near perfect clustering. The significance of each step in the analysis will be discussed in the following sections.

A closer phylogenetic relationship is found within the same antigen specific sequences. From that point of view, close phylogenetic relationship could have been converged by means of antigen specificity. This seems logical as the sequences are inherently related to structure/function of the antibodies. In order to possess certain antigen recognitions, the structure needs to fit the shape of the antigen. This selection criterion would contribute to the convergence of the antigen specific sequences. Yet, as described in the results section, there were still scattered clusters that could not be properly grouped. Although the phylogenetic relationship was indicating intrinsic features embedded in amino acid frequency, a more detailed analysis of sequence-based parameters should improve the clustering efficiency. With foundation of the possibility to properly cluster antigen-specific VH sequences together, other sequence-based parameters would be worthwhile candidates for further testing.

Amino acids frequency was investigated as initial parameter test for clustering performance. Surprisingly, amino acids frequency could effectively classify antigen-specific VH sequences into their respective groups as shown in Figure 5. It was originally expected that the amino acids frequency would be indistinguishable among the sequences simply due to the fact that a major portion of the sequences should represent conserved framework regions. Hence, it would be expected that ineffective clustering would be the consequence due to insufficient sensitivity. Contrary to the expectation, the results revealed that solely the CDRs would be sufficient to provide enough sensitivity for the groupings. As a result, proceeding with amino acid sequence based parameters should be a viable approach. Moreover, these parameters should be very approachable as design guidelines for de novo generation for antigen-specific antibodies. It is noteworthy to point out that physical property distribution in terms of hydrophobicity and charge did not result in effective clustering. This is expected in the sense that the overall structure of different antibodies is very similar and in order to properly assemble a similar structure, it seems logical to require a similar distribution of the physical property. Therefore, the four physical property subgroups mentioned in the results section did not yield effective groupings. Nevertheless, it was suggested from the amino acid frequency experiment that CDRs were sufficient to classify the sequences. Therefore, proceeding with respective CDRs sequence-based parameters would be a logical next-step.

A total of five different CDR related parameters were investigated. To recall, these parameters included 1) the length of CDR1 in amino acid sequence, 2) the length of CDR3 in amino acid sequence, 3) hydrophabicity, 4) CDR3 Arginine (R) count, and 5) CDR3 Lysine (K) count. As Figure 6 has shown, the best performing combinations of parameters are set B. Parameter set B included CDR3 pertaining parameters as summarized on Table 2. This correlates well with the fact that CDR3 is the region of contact, namely the antigen binding site. CDR1 and CDR2 have some roles in recognition but they mainly contribute to the structural stability of the variable regions. Additionally, the CDR3 length encompasses the allowable recognition breadth of the antigen binding site; thus, this should relate faithfully to the antigen that the antibody is specific for. Hyrdorphabicity, R and K counts all pertain to the physicochemical properties of the CDR3. This would have a significant effect on the interactions between the antigen and the antibody. Thus, these parameters contribute immensely to the antigen-specificity and unequivocally attribute to the uniqueness of the sequences. Hence, the parameters shown on Table 2 can serve as design guidelines for antigens described in this report. In order to improve the affinity of the antibody, permutations of amino acids for substitutions under the design guideline can potentially serve as an affinity maturation strategy. This should provide significant reduction of time required for affinity maturation as compare to maturation in biological systems.

This methodology can be applied to multiple common antigens and this information can be archived to construct a database. With such huge database, possible implications and generalities might be determined to allow for in silico design of  novel antigens. Nevertheless, this project can serve as a proof-of-concept and it generated a set of design guidelines for antigens: C1s, HEL, and Br. Thus, this methodology is worthwhile for future expansion to other antigens or the explorations of other potential parameters.


References

1.         McHeyzer-Williams, L.J., Malherbe, L.P. & McHeyzer-Williams, M.G. Checkpoints in memory B-cell evolution. Immunol. Rev211, 255-268 (2006).

2.         Murphy, K.P. Janeway’s Immunobiology. (Garland Science: New York, 2008).

3.         Schroeder, H.W. et al. Developmental Regulation of the Human Antibody Repertoirea. Annals of the New York Academy of Sciences764, 242-260 (2008).

4.         Sanz, I. Multiple mechanisms participate in the generation of diversity of human H chain CDR3 regions. The Journal of Immunology147, 1720 -1729 (1991).

5.         Schroeder, J. Similarity and divergence in the development and expression of the mouse and human antibody repertoires. Developmental & Comparative Immunology30, 119-135 (2006).

6.         Glanville, J. et al. Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proceedings of the National Academy of Sciences106, 20216 -20221 (2009).

7.         Köhler, G. & Milstein, C. Continuous cultures of fused cells secreting antibody of predefined specificity. 1975. J. Immunol174, 2453-2455 (2005).

8.         Reddy, S.T. et al. Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat Biotech28, 965-969 (2010).

9.         Davis, M.M. The evolutionary and structural “logic” of antigen receptor diversity. Semin. Immunol16, 239-243 (2004).

10.       Brochet, X., Lefranc, M.-P. & Giudicelli, V. IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res36, W503-W508 (2008).

11.       Edgar, R. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics5, 113 (2004).

12.       Tamura, K., Dudley, J., Nei, M. & Kumar, S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0. Molecular Biology and Evolution24, 1596 -1599 (2007).

13.       Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences95, 14863 -14868 (1998).