To download a library of proteins from the PA, SB and SC families, a python script was utilized to find and download protein structures from the RCSB Protein Data Bank.
QUERY:
First, the script defines the three configuration dictionaries utilizing their Pfam ascession, resolution cutoff, sequence identity threshold, as well as an exclusion for mutations. The code then builds a JSON query applying these filters, and group them by sequence identity to ensure a non-redundant set. It then compiles all of the PDB IDs and downloads their .ent files.
To find the active site serine, a "default position" of 195 was set for the PA family due to naming conventions. Then, the other active site serines were manually compiled in a dictionary titled "exceptions". It then accesses the previously downloaded .ent file for the protein and searches 8.0Å from the alpha carbon of the catalytic serine. Afterwards, it puts the catalytic residue it's close to, the residue type, the spacial coordinates, and distance to the catalytic serine's alpha carbon into an Excel spreadsheet. This analysis was repeated for all protein families.
As a result, we accumulated the following data
Catalytic Serine alpha carbon spatial coordinates
Alpha carbons within an 8Å radius, and their corresponding amino acids
Their distances to the catalytic serine
Their spatial coordinates