One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed the computation of the novel Arithmetic Residuals in K-groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into K classes (K = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 vs. ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used [1] five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. This approach can not only be used in assessing environmental/ecotoxicity endpoints but can also be extended to other fields like drug discovery.
Reference: [1] Banerjee A, Roy K, ARKA: A framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data. Environ Sci: Process Impacts, 2024, https://doi.org/10.1039/D4EM00173G
Disclaimer: The figures on this page are taken from the above publication.
This tool calculates multiple ARKA descriptors based on the user's requirements to develop regression-based QSAR models. This considers different contributions of the relevant features to different response ranges of the training set within a particular regression model.
Download link (Uploaded on 19.12.2024;Unrestricted from April 03, 2025)
Reference: Banerjee A, Roy K, The multiclass ARKA framework for developing improved q-RASAR models for environmental toxicity endpoints. Environ Sci Process Impacts, 2025, https://doi.org/10.1039/D5EM00068H
To use this tool, please fill in https://forms.gle/1r3TTy7RmZCQvqBt5 and sign the License agreement form