To begin our analysis, we decided to analyze the protein sequences of the Serine Proteases
To investigate variation in serine protease active-site environment, hierarchical clustering and principal component analysis (PCA) were performed using a residue presence/absence matrix derived from the 8 Å neighborhood surrounding the catalytic Ser195. In this matrix, each structure is represented by a binary feature vector indicating the presence of specific residues near the catalytic site. PCA was applied to reduce dimensionality and visualize the overall similarity between structures, while hierarchical clustering was used to group proteases with similar active-site compositions.
Hierarchical clustering of the active-site environments reveals that serine proteases share a highly conserved catalytic core while exhibiting variability in their surrounding residue composition and spatial organization.
In the heatmap, most residue count features remain consistently low across structures, indicating that only a limited subset of residues frequently occurs near the catalytic Ser195. Greater variation is observed in the distance-based features, suggesting that while similar residues are present, their spatial arrangement relative to the catalytic serine differs across structures.
Glycine (GLY) appears more frequently in the vicinity of the catalytic serine than most other residue types, suggesting its important structural role in the active-site environment. Despite the conservation of key catalytic residues, the overall number of residues such as glycine, aspartate (ASP), alanine (ALA) and histidine (HIS) within the 8 Å region remains relatively low. This pattern is consistent across the analyzed families, where the catalytic core is maintained by a small number of essential residues, while the surrounding environment remains sparse and structurally constrained.
An aligned numbering system has been created to compare proteins, even if they do not follow the conventional Ser195 naming system. The unified numbering system started by labeling the catalytic serine (N) as 0 and residues X upstream N+X, and residues Y downstream N-Y. This created a system where we could, in a sequence, determine which residues were conserved at specific positions in the sequence