Manual and Notes

Chemical Curation

It includes both the identification and correction of the structural errors for a set of compounds. The steps involved in chemical curation are supposed to depend on the type of analysis and/or modeling study. Here, we have mentioned the most common steps (especially recommended for QSAR studies) that also have been incorporated in the KNIME workflow available here:

1) Reading and storing the information present in the Structure-Data File (SDF) file:

Usually, when we download a dataset from any database (like BindingDB), an SDF file is downloaded. This SDF file comprises the structural information like molecule name, atom and bond count; 2D or 3D-coordinates of the atoms present in a molecule; connectivity, bond type, stereotype etc. Along with the requisite structural information, optionally, it may also comprise of various properties like unique ID, molecular weight, molecular formula, biological property (IC50/EC50/Ki), charge, SMILE notation, and InChI key, etc. For chemical curation, the structural information is usually enough to find and correct and/or remove the erroneous and/or duplicate chemical structures. But if the chemical curation is succeeded by the biological curation, then one should also store other information/properties like uniqueID, biological property etc. required for biological curation. So the molecules with inappropriate/incomplete structural information or requisite properties value should be discarded. Thus, reading a SDF file and storing the required information correctly should be the first step of chemical curation. Here, in the KNIME workflow, we have used “SDF Reader” node for reading the input SDF file and storing the structural information. This node also provides option to read and store the available properties in the input SDF file along with the structural information.

2) Removal of Inorganics, Organo-metallics and Mixtures

Most of the cheminformatics software cannot handle the inorganic molecules, for instance, in QSAR the majority of molecular descriptors are usually computed for organic compounds and thus majority of software can only handle organic compounds. Thus, presence of inorganics or even organo-metallics may lead to wrong descriptor values or will be simply rejected by the respective software. Thus, it is recommended to remove all inorganic compounds and even organo-metallics (if avoidable) before the descriptors are calculated. For the similar reasons, the mixtures should be removed prior to the descriptor calculation. The treatment of mixture is not that easy and thus unless the active component is known, removal of mixture is advised. For more details, one can go through this paper [Reference: click here]

3) Removal of salts

Data set can also comprise of salts that are a common form of many drugs. Similar to inorganics or organo-metallics, salts are not properly handled by the descriptor-calculating software, and thus their presence can generate errors in descriptors calculation. Removing the salts is usually advised, unless removal of metal counter ions and the neutralization of the remaining carbocations is feasible for which the precise information about the experimental conditions under which the compounds have been tested or the physicochemical environment within cells where the compound is active is known.

4) Normalization

Normalization of a chemical structure is required when there is a possibility of representing the same functional group using different structural patterns. For instance, nitro groups can be represented using two double bonds between nitrogen and oxygens (neutral form), or one single bond linking the nitrogen and the protonated oxygen, or linking both nitrogen and oxygen atoms that are oppositely charged. Now different representations of same chemical structure may create serious problems because molecular descriptors calculated for these different representations of the same functional group could be significantly different. Further, it will also affect duplicate analysis, when it is based on the descriptor values. So transformation of all such functional groups to some standard forms is highly essential.

Other possible errors/checks: Wiggly bonds, steric clashes, correct stereochemistry, explicit/implicit hydrogen’s etc.

For more details about chemical curation, these articles would be helpful: Click here, Click here

Biological Curation

1. Duplicate Analysis

Duplicates identification and removal is again a very important task, especially for the large data sets, where manual identification is not feasible. The most suitable way of finding the duplicates is based on the similarity search that is based on the 2D-descriptors. Note that there is no set of descriptors universally recognized to be best for duplicate recognition. Efficiency of duplicate identification depends mainly on the type and the number of descriptors used to represent the compounds. But merely identification of duplicates removal is not the actual achievement, especially in QSAR studies. The main goal is duplicate analysis, which is as follows:

For a given pair of duplicate structures, if their experimental properties (of interest) are identical, then one compound should be simply deleted.

However, if their experimental properties are numerically different, then one should consider the following scenarios:

The property value is wrong for one compound, for instance, result of human error. Another possible and frequently observed reason is that the data were collected from different literature sources and the same compound was tested in two or more different laboratories under different experimental conditions, variations in the protocol, etc., leading to the difference in measured property. In such cases, one might use the following ways to rectify the error:

If both experimental properties are having similar values: One of the compounds (duplicates) can be kept with the arithmetic average of properties

If they are considerably different:  If possible supplementary investigations is advised, for instance, In QSAR, instead of removing such compounds, we may keep these compounds in true external set (not test set), so that we might check the predicted property for these compounds using the developed model, which might help in rectifying the error. Otherwise, such compounds should always be eliminated.

Example of Duplicate Analysis

Duplicate Analysis

2. Detection and Verification of Activity Cliff

Activity Cliffs are regions where large changes in activity are observed for relatively small changes in structure. Thus, compounds having high structural similarity, but shows large difference in biological property (or any property of interest) is difficult to understand or interpret using cheminformatics techniques like QSAR, which are based on the chemical similarity.

Matched Molecular Pairs (MMPs) is one of the most useful methods to detect and verify the activity cliffs. MMP can be simply defined as a pair of molecules that differ in only a minor single point change. Single point changes in the molecule pairs are termed a Molecular/Chemical transformation. For finding activity cliffs, the transformation is considered significant, if it leads to drastic change (increase/decrease) in the property value with minor single point change. 

Example of Activity Cliff analysis using MMPs

Activity Cliffs Analysis Using MMPs

For more details about biological curation, this article would be helpful: Click here