HMM
Data sources
The HMM was built using profiles of Immunoglobulins (Ig) acquired from the Pfam database Immunoglobulin V-set domain. (PF07686) | Source: https://pfam.xfam.org/family/PF07686
The human exome database was acquired from the Human Genome Resources at NCBI. | Source: https://www.ncbi.nlm.nih.gov/genome/guide/human/
Building HMM
V-set domains are Ig-Like domains that resemble the antibody variable domain. Utilizing HMMER software and a multiple-peptide sequence alignment of Ig-V set domains provided by Pfam; a HMM was created that could predict the peptide sequences of Ig-V set domains. Predictions were made by searching through a database of human peptide sequences (hg38). Validation of the HMM was performed by attempting to identify Ig-V set d the full representative proteomes of Ig-V set domains also supplied by Pfam. Our HMM matched with the representative proteome domains as expected.
Example Code using HMMER:
hmmbuild Ig-Vset.hmm PF07686_seed.txt
[Function] [HMM Output file] [Training Data]
Extracting features from HMM
Using the previously built HMM (Ig-Vset.hmm), we searched through the Human exome (acquired from NCBI) to predict Ig-V domains present in expressed proteins found in humans. Only significant hits of predicted Ig-V domains were returned.
Example Code using HMMER:
hmmsearch Ig-Vset.hmm GRCh38_latest_genomic.faa
[Function] [HMM] [Database]
Antigen Target Prediction
To understand how the predicted domains from the HMM may bind to different targets, we generated probabilistic predictions of potential antigen targets. To do this, the back-end programs that contribute to the Sequence-based protein-protein partner search (SPPS) model and the Kangaroo biological sequence database pattern-matching program.
SPPS support vector machine model
SPPS used a machine learning technique called support vector machine (SVM), which determines classes using the vector space of training data and hyperplanes going through the vector space. If we have time, here is a cool visualization of the basic idea of an SVM:
https://cs.stanford.edu/people/karpathy/svmjs/demo/
This particular SVM was trained on data from many proteins using amino acid grouped by property, such as dipole interactions and volume of the side chains. This was done to reduce the high dimensionality of the vector space, avoiding overfitting and correcting for synonymous mutations. At a very high level, the protein-protein interactions are determined via concatenating the vector space for two proteins and evaluating support vectors within this space. To only represent the important information of each protein interaction, amino acids were abstracted to a conjoint triad--regarding a single amino acid and its adjacent amino acids as a single unit. This helps extract information about interactions from just sequence rather than structure.
After calibrating parameter of the model to optimize the tradeoff between margins and training error (bias-variance tradeoff), their model reached about 88% accuracy. This is shown titled "SVM calibration".
SVM calibration
Kangaroo pattern matching program for biological sequences
After a model was identified for predicting protein-protein interactions based on sequence, it needed to be implemented in a high-throughput way for query sequences against a database of antigen sequences.
Kangaroo is a program that can do ad-hoc that performs basic regular expression screening for queries against biological databases. This program can use the predicted interactions outputted from the SPPS support vector machine as a regular expression to search through a database to evaluate probability of binding for each sequence in the database. This was done on an antigen database pulled from SwissProt.
Kangaroo can only screen one sequence against one database at a time. Here is our solution for doing this in a high-throughput way.
#!/bin/bash
for ((i=1;i<50;i+=2))
do
sed "${i}q;d" ig_topIsoForms.txt > ${i}_sequence.fasta
j=$(($i+1))
sed "${j}q;d" ig_topIsoForms.txt >> ${i}_sequence.fasta
output=$(head -1 ${i}_sequence.fasta)
output=$(echo "$output" | awk '{print $2}')
echo $output
#./Kangaroo -f 3 -p 1 -m human_model.pmodel -q ${i}_sequence.fasta -d antigens.fasta -o ${output}_binding
done
wait
Data sources
SwissProt annotated protein database was used to identify target antigens with highly studied sequences.
Sample input data
Data from HMM was pulled in FASTA format. Antigen database was also in FASTA format.
> XP_011511011.1_isoform0
D-----KSGKLELQMADSFDTGVYHCIS-SNY
> NP_056234.2_isoform0
QRDQTVLEGGPCQLSCNVKAS---ESPSIFWVLPDG------SILKA----PMDDPDSKFSILS-----SGWLRIKSMEPSDSGLYQCIA
> XP_011511010.1_isoform0
D-----KSGKLELQMADSFDTGVYHCIS-SNY
Sample output
Output was given in a format where the name of antigen was given along with a probability of binding to the tested sequence. A file exists for each predicted immunoglobulin domain against the same database.
sp|P18577|RHCE_HUMAN Blood group Rh(CE) polypeptide OS=Homo sapiens OX=9606 GN=RHCE PE=1 SV=2 -1 0.311159 0.688841
sp|P04233|HG2A_HUMAN HLA class II histocompatibility antigen gamma chain OS=Homo sapiens OX=9606 GN=CD74 PE=1 SV=3 -1 0.215687 0.784313
sp|P04440|DPB1_HUMAN HLA class II histocompatibility antigen, DP beta 1 chain OS=Homo sapiens OX=9606 GN=HLA-DPB1 PE=1 SV=1 -1 0.0635055 0.936495
sp|P01920|DQB1_HUMAN HLA class II histocompatibility antigen, DQ beta 1 chain OS=Homo sapiens OX=9606 GN=HLA-DQB1 PE=1 SV=2 -1 0.190846 0.809154
sp|Q5Y7A7|2B1D_HUMAN HLA class II histocompatibility antigen, DRB1-13 beta chain OS=Homo sapiens OX=9606 GN=HLA-DRB1 PE=1 SV=1 -1 0.115225 0.884775
sp|Q9GIY3|2B1E_HUMAN HLA class II histocompatibility antigen, DRB1-14 beta chain OS=Homo sapiens OX=9606 GN=HLA-DRB1 PE=1 SV=1 -1 0.116424 0.883576
sp|P79483|DRB3_HUMAN HLA class II histocompatibility antigen, DR beta 3 chain OS=Homo sapiens OX=9606 GN=HLA-DRB3 PE=1 SV=1 -1 0.15028 0.84972
sp|Q30154|DRB5_HUMAN HLA class II histocompatibility antigen, DR beta 5 chain OS=Homo sapiens OX=9606 GN=HLA-DRB5 PE=1 SV=1 -1 0.18106 0.81894
sp|Q95IE3|2B1C_HUMAN HLA class II histocompatibility antigen, DRB1-12 beta chain OS=Homo sapiens OX=9606 GN=HLA-DRB1 PE=1 SV=
The antigen with the highest probability of binding to each domain was extracted.
Phylogeny Tree/ Clustering
An initial clustering was performed with the predicted gene segments amino acid frequencies from the HMM model using Morpheus. Amino acid frequencies were calculated using a python script. This hierarchical clustering yielded no productive insights pertaining to the gene segments. Another clustering software was then utilized, Clustal Omega. The gene segments were cleaned then run through Clustal Omega to create a multiple sequence alignment. Apart of the cleaning was adding "isoform" identifiers to each gene name. This allows for differentiation between segments in the same gene. Lower numbered isoforms are closer to the N terminus of the protein than higher isoforms. A predicted phylogenetic tree was computed using the Clustal Omega alignment.
Another tree generator, PHYLIP, was used to compare clustering and phyla creations to previously used software. PHYLIP and Java was installed and test trials were run to ensure proper installations. A Clustal Omega alignment was used to generate a file that was formatted correctly for PHYLIP. This was then run through the proml application within the PHYLIP package. Proml generates a tree output file that can view and manipulated in a different, Java-based application, DrawGram.
Supplemental code
Parsing output from both HMMer and Kangaroo
#######
import os
from Bio import SearchIO
def parse_HMM(infile_path, outfile_path, mode = 'hmmer3-text'):
"""
Function to parse through a HMMer file using Biopython.
"""
records = list(SearchIO.parse(infile_path, mode))
outfile = open(outfile_path, "w")
for i in range(0,len(records[0])):
for j in range(0, len(records[0][i])):
# print(str(">" + " " + str(records[0][i][j].hit_id)))
temp_string = ""
for elt in list(records[0][i][j].hit):
temp_string += str(elt)
outfile.write(str(">" + " " + str(records[0][i][j].hit_id)))
outfile.write("\n")
outfile.write(temp_string)
outfile.write("\n")
outfile.close()
def parse_Kangaroo(infile_path):
"""
Function to parse through Kangaroo biosequence screen outputself.
"""
infile = open(infile_path, "r")
infile.readline()
hit_list = []
hit = ""
highest = 0
for line in infile:
line = line.strip()
line_list = line.split(" ")
# determine if hit and then append to hit file
try:
prob = float(line_list[-2])
except ValueError:
pass
hit_item = []
hit_list_temp = []
if (prob > .8):
hit_item.append(line_list[0])
hit_item.append(line_list[-2])
hit_list_temp.append(hit_item)
hit_list_temp = sorted(hit_list_temp, key=lambda x: x[1])
for elt in hit_list_temp:
hit_list.append(elt[0])
# find the highest binding target
if (prob > float(highest)):
highest = prob
# make a string describing the antigens
hit = line_list[0]
infile.close()
return hit, hit_list
def write_binding_file(directory_path, output_path):
"""
Function to write a file describing binding of each protein.
"""
outfile = open(output_path, "w")
outfile.write(str("predicted" + "\t" + "highest" + "\t" + "targets" ))
outfile.write("\n")
for file in os.listdir(directory_path):
full_path = str(directory_path + "/" + file)
hit, hit_list = parse_Kangaroo(full_path)
hit_string = str(hit_list)
# for elt in hit_list:
# hit_string += str(elt + ", ")
binding_description = str(file + "\t" + hit + "\t" + hit_string)
outfile.write(binding_description)
outfile.write("\n")
outfile.write("\n")
outfile.close()