Project idea

10 June 2022

Project idea

MDAnalysis deals with various topologies with different structures, each of which has its specifications for presenting its data. The challenge is that not every piece of data exists explicitly in the topology file, which brings an extra task on MDAnalysis to guess those data. This mission of guessing has the main problem of dealing with different formats that don’t speak the same language, and its representation of the same data may happen in a completely different way. That’s why we need a guesser class that is aware of the language of the universe of which it is a part.

Dealing with each format individually will give us the chance to tailor each attribute guesser individually; depending on the nature of the universe environment and which available attributes could lead us to the desired attribute.

Here is two graph representing attributes relation for PDB and Martini force field respectively.

PDB attributes relations

Martini forcefield attribute's relations

Project deliverables

The project has three main deliverables:

• Implement a context-aware guesser API to provide the guessing context to the universe.

• build a robust guesser for the PDB file format.

• build a robust guesser for the Martini force field.

1. Add a convenient way to provide the guessing context to the universe:

We need a more reliable guessing methodology that avoids the errors arising from the current guesser class (ex: issues #2348 # 3218 #2331 #2265).

The problem with the current guesser is its generic nature, which makes it challenging to fit with different contexts and get the right output for all of them. To solve this problem, we can implement a context-aware guesser that is tailored for each force field/format system.

This new guesser package must fulfill the following properties:

Guessing is not an automatic process; by default, guessing is off unless the user chose to guess which property.
Guesser should raise warning/messages when succeeding or failing in guessing a property (one single message for the whole guessed attribute).
Guessed properties should be easily modifiable.
Passing the context to the universe should be available at and after the initiation level.
Modifications and maintenance of guessers should be convenient.

Implementation:

The BaseGuesser class will be the parent class for each new context-specific guesser, it will hold no guessing methods; to make space for implementing customized child guessers, yet it will define the behavior by which child guessers should behave for organizing the guesser structure.

At the universe initiation level, the context will pass as an argument (either as a name or a guesser object), in addition, a to_guess list will pass the desired attributes to be guessed.

Also, the guesser class could be called after the universe's creation with the guess_topology_attr() API with the same spirit as the above implementation.

If the user didn’t specify a context to the universe, a DefaultGuesser will be called, this guesser class will have the current generic guess methods.

The universe then will check the validity of the passed arguments and raise the appropriate errors accordingly.

Different guessing methods will through warnings/error messages depending on the results it got, the output messages should precisely describe the universe updates with a warning about failed processes.

The mission of deciding how an attribute will be guessed will be carried out by the corresponding attribute guesser method related to each class. I think in this way the user doesn’t have to bother about how a guesser should work, and in the spirit of implementing a context-specific guesser, we have an abstraction power by having aware and smart guessers that know how exactly any attribute should be guessed for a specific environment (for example guessing mass for PDB is related to the element property, while for Martini is more related to bead type (atom type in MDAnalysis).

2. PDB guesser

The PDB file is a highly systematic file format that mainly represents biological macromolecule structures. The current guessing methods can’t deal optimally with PDB files, which makes guessing processes slow and not reliable. So, if we had a PDB-aware guesser, this process could improve significantly, especially that PDB has a huge archive called the chemical component dictionary (CCD), which describes every single residue that can exist in a PDB file (its atom names, atom elements, bonds, bond orders, charges, aromaticity, etc.), we don’t need to assume any property for a PDB residue. Retrieving data from this dictionary can be done in either of two ways; either to build our local database in the guesser module or to use an API to get the data from the online CCD.

Build a local database

This can happen by building dictionary databases like the one existing in tables.py and then the different properties guessers will lookup data in those tables with the appropriate relations.

The database might be something like the tables below:

Advantages of a local database:

a. Fast to lookup
b. Fewer dependencies on external packages

Disadvantages:

a. Need regular maintenance and design optimization
b. The database size could be huge

Use an API to get data from the online chemical component dictionary

pdbeccdutils is an API built by PDB to retrieve data from the online CCD, it downloads and reads mmCIF data format, then store the residue data in a nested dictionary called component. By the appropriate key, we can get the data that we want.

Advantages of API:

a. No need to store or maintain data

Disadvantages:

a. The API is still under development which might result in frequent instability.
b. The downloaded files for each residue contain lots of unwanted data for the guesser, which may result in slowness in the guessing process.

3. Martini guesser

Martini forcefield is concerned with dealing with coarse-grained particles rather than working with all atoms models. So, its parameters need special care regarding guessing their properties. In the Martini guesser implementation, we will focus on the newly released version, Martini 3. In the new version, seven bead types exist according to chemical properties, and they are further subdivided with numbers and labels according to polarity and hydrogen donor/acceptor properties. In addition, there are three beads’ sizes according to the number of non-hydrogen atoms forming it.

The bead's name carries lots of chemical properties that can be of benefit in various analytical processes. For our first version of Martini guesser, we will initially be concerned with bead types (atom type in MDAnalysis), masses, and charges guessing.

Bead types

Martini has a well-established database for almost all the beads, from which most polymers (lipids, proteins, carbohydrates, etc.) are built. We can build our local database to store these data, and by proper mapping, we can guess the bead types for known polymers. However, if the residue is unknown by our database, we can check if the bead name is a valid one, then we can safely consider it the bead type.

Masses

Martini has three bead sizes, each of which has a default mass regardless of the underlying type of atoms. The default masses of R, S, and T beads are 72, 54, and 36 amu, respectively, this information lies in bead type. However, if the molecule has a virtual site, its mass must be distributed over the other beads in the molecule, which make the virtual site an important attribute to store.

Charges

Monovalent and divalent atoms are distinguished by the beads types Q and D. Regarding the sign of the charge, there are two clues to make the guess; the first one is the positive and negative labels (p/n), which are given to some charged molecules that possess hydrogen bonding capabilities (p for positive acceptor molecule and n for negative donor molecule), the second one is the Martini bead assignment table for charged molecules, we can use this as a dictionary database for our guess_charge() method. However, there are some anions and cations that have common names, so depending on this table alone is not sufficient, so we can use residue names or the neighboring beads as a hint to make better guessing.