SprintFamily: Algorithms for gap filling in context-specific metabolic networks
Pavan Kumar S
Abstract
For a better understanding of the metabolism of an organism, it is crucial to build detailed mathematical models. The availability of omics data in the past decade helped to improve our understanding of metabolism through genome-scale metabolic models (GEMs). To capture the reactions that are active in a given condition, transcriptomics are integrated into GEMs to build context-specific models (CSMs). A context here could refer to any perturbation that can alter the gene expression levels. Based on the expression levels of the genes and the Gene-Protein-Reaction rules, the core reactions that are known to be active in the given context are identified. However, noisy data, improper thresholding, and lack of genetic evidence for spontaneous and diffusion reactions often result in an incomplete draft of a CSM that has only the core reactions. In this study, we developed three distinct algorithms to build and analyse the CSMs from GEMs in a rapid manner. The first algorithm, SprintCore, integrates transcriptomics into GSM to construct CSM. The second algorithm, SprintCC, checks for consistency and reports blocked reactions in a metabolic reaction network, and the third algorithm, SprintTag, tags all the reversible reactions as reversible or pseudo-irreversible. SprintFamily of algorithms outperforms the previous algorithms.
2. Self-assembly in Peptides using Machine Learning
T Yahwah Nissi
Abstract
Peptides are chains of amino acids that occur naturally within living organisms. They find applications in various functions, from tissue engineering, surface coatings to catalysis and sensing. One of the major types of secondary structure that peptides display is the β-sheet, which consists of several chained amino acids. However, there is limited information available on which amino acid sequences promote the formation of β-sheet in peptides. In this work, we use machine learning (ML) to discover an analytical scoring function that captures the propensity of the formation of β-sheet secondary structures in peptides. Furthermore, this work is based on direct experimental data, with the performance of the ML model improved iteratively. Our methodology results in the discovery of a relatively simple function based on easily accessible peptide features such as hydrophobicity, empirical β-sheet score, etc., that can accurately predict the β-sheet formation capability of the peptides. The ML-derived analytical function not only improves our understanding of the β-sheet formation in peptides but could be used to guide future experimental work in this area.
3. Elucidating Gene-Environment Relationships using Deep Learning
Rajeeva Lokshanan
Abstract
One of the major goals of genetics over the past century has been to understand the genotype-phenotype relationship, i.e. how the genotype of an organism affects its fitness in a given chemical environment. Traditionally, this goal has been pursued by experimentally determining the fitness of different genotypes of an organism (such as Saccharomyces cerevisiae) in various growth conditions. Modern machine learning and deep learning tools can help us understand the inner workings of this relationship while avoiding prohibitively expensive experiments. However, current methods require us to create a separate model for each chemical condition, thereby rendering the process cumbersome. Given this, we develop a novel approach to build a chemical-agnostic deep-learning model to help us elucidate the organism's fitness in any chemical condition. To achieve this, we use an auto encoder to obtain a numerical representation of a chemical molecule. We use this along with the gene level variant data in a deep-learning model to predict the organism's fitness. We show that our model can more accurately predict the growth rate of yeast strains compared to existing literature.
4. Understanding Population Specific Genetics of Complex Disorders
Ritwiz Kamal
Abstract
Humans across the world have different genetic makeup depending upon their population, ethnicity, and lifestyle among prominent deciding factors. Consequently, different populations have different genetic predispositions to complex polygenic disorders such as type 2 diabetes, cardiovascular disorders, and neurological disorders. Traditionally, genome-wide association studies (GWAS) have been used to examine associations between genetic factors and traits of interest albeit with a concerning under-representation of non-Caucasian populations. There is, therefore, a pressing need to devise methodologies that make use of existing large and small scale GWAS to better understand the differential effects of genetic factors among populations. In this work we make use of PRS-CSx [1], a trans-ethnic polygenic risk score model that leverages GWAS summary statistics, external LD reference panels from multiple populations and a continuous shrinkage prior for improved posterior SNP effect size estimation. We focus on European and South Asian populations to come up with per SNP and per gene Bias Scores for various disorders in order to find SNPs/genes that exhibit consistent or different effects on a disease trait across populations. Beyond application of PRS-CSx, we also explore possibilities of obtaining better estimates of per SNP effect sizes on a target population using multiple auxiliary populations by extending methods [2,3] that address transfer of effect sizes from one auxiliary population to one target population.
5. Predicting the effect of perturbating specific regulatory proteins using an integrated GRN+MN (Gene Regulatory Network and Metabolic Network) modeling framework
Nilesh Anantha Subramanian
Abstract
Building a whole-cell model is a holy grail of computational biology. Towards this end, we want to focus on modeling two main processes in a cell; gene regulation, and metabolism. Gene regulation is a process where genes regulate each other by switching them on/off in an interconnected fashion and is modeled via a gene regulatory network (GRN) - transcription factors (TF) that regulate target genes (TG). Metabolism is a process in which metabolites take part in reactions inside a cell, catalyzed by enzymes (which are products of genes), and is modeled via a stoichiometry matrix that captures the production and consumption of metabolites in different reactions and a set of linear constraints on reaction fluxes (which is eventually solved via linear programming to compute the fluxes of the reactions). There are a lot of studies that model both these processes separately. In fact, the influence of GRN on MM has also been explored, where the TGs affect the reactions in the metabolic model through the enzymes they code for [1]. In comparison, the feedback from MM to GRN is less studied. Metabolites can also regulate the activity of TFs and thereby affecting the downstream process [2]. In our work, we want to primarily focus on predicting the effect of metabolite on TFs through its perturbation on an integrated GRN+MM model. We plan to model GRN through a Bayesian approach that involves structure learning through a greedy approach and parameter learning/inference through causal intervention. Our novel contribution to this work would be to use the metabolites to regulate the TFs and thus create a feedback model to predict the effect of perturbation. For this purpose, we chose E.coli as the model organism to carry out our framework since metabolites regulating TFs, GRN, and metabolic networks are well-studied [2]. This framework would first be executed through a toy model (for both GRN and MM), and then be scaled for an E.coli model.
6. Decoding Gene-Disease Connections: A Breakthrough in Causal Relation Extraction of Biomedical Entities from Biomedical Literature
Nency Bansal
Abstract
Causality, also referred to as the cause of effect, talks about the cause that can be any event, process, or state that leads to the happening or creation of another process, state, or event. Causal relationships can be among genes, genes causing disease, etc. Manual extraction of causal relations among various biomedical entities is time-consuming and impractical. An automated approach to extract and summarize such information is necessary to facilitate knowledge discovery and support the research community. Detecting these causal relationships can help to detect, assess, understand, and prevent the unwanted side effects of a drug/vaccine while treating a disease. Different types of relations exist between entities, like association, which means one entity is dependent on the other, and causation, which means one entity is the cause of the other. There are a lot of works that talks about relation extraction among various entities, but in the biological domain, having causal knowledge is more important. Some works like BERT-GT [1], K-BERT [2], causality to mine Sjögren’s Syndrome related factors [3], etc., also talks about causal relation extraction. Still, they are either limited to extracting causality within the sentence or find causal relation between chemicals and disease. In comparison, causal relation extraction between a gene and a disease, inter sentence causal relation extraction, and inter document causal relation extraction is less studied. In our work, we primarily want to focus on inter-sentence causal relation extraction among various entities, starting with genes causing the disease. Then we plan to extend it further to inter-document causal relation extraction, by which we plan to provide the causality score between a gene and a disease. For this purpose, we are curating our dataset by annotating genes causing disease in the abstracts of published articles. Our method leverages the latest natural language processing techniques, like BERT model, included with graph transformers and knowledge bases to accurately recognize and associate genes and diseases across a diverse range of scientific articles. After fine-tuning the model on a small curated dataset, we aim to test it on large-scale data. We also aim to develop an automated tool that will be user-friendly so that everyone can easily use and access it.
7. Metagenome-based metabolic modelling predicts unique microbial interactions in deep-sea hydrothermal plume microbiomes
Dinesh Kumar K B
Abstract
Deep-sea hydrothermal vents are abundant on the ocean floor and play important roles in ocean biogeochemistry. In vent ecosystems such as hydrothermal plumes, microorganisms rely on reduced chemicals and gases in hydrothermal fluids to fuel primary production and form diverse and complex microbial communities. However, microbial interactions that drive these complex microbiomes remain poorly understood. Here, we use microbiomes from the Guaymas Basin hydrothermal system in the Pacific Ocean to shed more light on the key species in these communities and their interactions. We built metabolic models from metagenomically assembled genomes (MAGs) and infer possible metabolic within the community. We highlight possible archaea–archaea and archaea–bacteria interactions and their contributions to the robustness of the community. Cellobiose, D-Mannose 1-phosphate, O2, CO2, and H2S were among the most exchanged metabolites. These interactions enhanced the metabolic capabilities of the community by exchange of metabolites that cannot be produced by any other community member. Archaea from the DPANN group stood out as key microbes, benefiting significantly as acceptors in the community. Overall, our study provides key insights into the microbial interactions that drive community structure and organisation in complex hydrothermal plume microbiomes.
8. A Multi-Tissue metabolic model for PCOS
Prashanth S
Abstract
Poly-Cystic Ovary Syndrome (PCOS) is a complex endocrine disorder that affects up to 7-10 % of women who are in their reproductive age all over the world. This condition involves interplay between several metabolically active tissues, and the cross-talk between these tissue has not been studied intensively. Most clinical studies indicate that PCOS eventually leads to metabolic syndrome. On the other hand, we have genome-scale metabolic models (GSM) which aid us in a better understanding of the metabolism of an organism. Using transcriptomics data, a GSM can be reconstructed specifically for a context like a cell type or a disease. Here, transcriptomic data of adipose tissue, Skeletal muscle, Granulosa cell, and Oocytes has been integrated into the human genome-scale model to extract tissue-specific models for PCOS and non-PCOS conditions. A multi-tissue model is built by integrating these individual tissue-specific models for both PCOS and non-PCOS conditions. The dysregulation of various metabolic pathways in PCOS condition is inferred. The multi-tissue model helps gain in-depth knowledge about the tissue cross-talk in the underlying condition.
9. A Computational Framework for Genetic Circuit Design
Debomita Chakraborty
Abstract
Synthetic biology is the application of engineering principles to biology. Building synthetic biological systems involves in silico aspects such as mathematical modeling, computer simulations, and algorithm development to predict and optimize the behavior of biological systems. In the current work, this is termed ‘design’ as opposed to ‘implementation’, which encompasses the actual wet lab procedures involved in realizing a synthetic biological system. With breakthroughs such as CRISPR-Cas9 genome editing, the implementation of synthetic biological systems has been revolutionized recently. However, such advances in the ’implementation’ aspects make it imperative to optimize the process of building a synthetic biological system end-to-end by producing designs that act as reliable blueprints for implementation.
This work establishes a computational design framework specifically for synthetic genetic circuits. This framework is designed on the basis of simulated data for all-possible three-node genetic circuits modeled using ordinary differential equations (ODEs). The gene expression time courses observed in the data are clustered to get the various functionalities possible to be achieved by three-node genetic circuits under different parametric conditions. Using the functional clusters, the corresponding circuit topologies are mapped to get a circuit topology-function relationship. Using the insights drawn therein, the framework developed aims to unravel design principles for each circuit functionality cluster.
10. Potential succession of multidrug-resistant Enterobacter bugandensis in International Space Station
Pratyay Sengupta
Abstract
Recently, the microorganism Enterobacter bugandensis was isolated from surfaces within the ISS during the Microbial Tracking-1 (MT-1) mission to the International Space Station. E. bugandensis, a species within the Enterobacter genus, is a nosocomial opportunistic pathogen due to its potential to cause life-threatening infections, particularly in neonates and immunocompromised patients. Being predominantly abundant in healthcare environments, Enterobacter can thrive in dry surfaces of skin to contaminated fluids (blood, swab, urine). The presence of E. bugandensis in the confined and isolated environment of the ISS raises concerns about its ability to survive and adapt to extreme conditions. Thus, continued surveillance of adaptation, succession and gain of virulence of a pathogenic microorganism in space is critical to design effective therapeutic strategies. We utilised comparative genomics, metagenomics and metabolic modelling approaches to identify the unique characteristics of E. bugandensis isolated from ISS. We investigated the coexisting community of the E. bugandensis in the ISS and explored the role of metabolic interactions in shaping the community. We also described the specific genes and phenotypes in ISS isolates that are possibly occurring as an exposure to the space environment.
11. Comparative genome analysis identifies unique biosynthetic potential of Bacillus pumilus SAFR-032 in extraterrestrial environments
Abhay Bhat
Abstract
Restricting forward and backward contamination during life-detection missions is essential to interpret novel life forms, if found, of extraterrestrial origin. Even with constant efforts on elimination of terrestrial microorganisms from spacecraft and landing vehicles, several microorganisms are able to thrive and fly to space. Bacillus pumilus SAFR-032 is an endospore-forming bacterium known for its exceptional resilience to bactericidal conditions, such as ultraviolet radiation and oxidative stress, abundant in spacecraft-associated surfaces. This unique resistance profile has made SAFR-032 a model strain for investigations into bacterial spore resistance, and more interestingly, consequences of the lack of it. Previously, SAFR-032 spores were exposed to various extreme space and Martian conditions aboard the International Space Station (ISS) for a duration of 18 months to explore their radiation tolerance. In this study, we compared the genomic features of exposed SAFR-032 strains with the ground controls and B. pumilus type strain ATCC7061. We observed 34 unique mutations present only across the exposed strains. A detailed pan-genome analysis revealed the presence of a gene family, IS1182 in multiple exposed strains, which is associated with DNA binding and transposition. Further, we observed presence of Surfactin synthase subunit 1 (srfAA) in multiple exposed strains, popularly known for having an antibiotic effect against competitors and production of anti-aging compound, surfactant. The potential of these gene clusters can be used as a target for future experimental research that extends across multiple disciplines.
12. Metagenomic map of chennai urban microbiome and antimicrobial resistance
Vijaya yuvaram singh
Abstract
One of the major challenges in modern medicine is the evolution of antimicrobial resistance. The current outbreak of COVID-19 has led to the widespread use of antimicrobials to treat secondary infections that could certainly invite future complications like the worsening of antimicrobial resistance. India is the most severely hit country during the second wave of COVID-19, hence there is an immediate need for surveillance and real-time tracking of the spread of antimicrobial resistance and COVID-19 prevalence. In our work, we collected metagenomics samples from major metro stations in chennai city and studied the microbial diversity. We performed taxonomic and functional profiling of microbial communities and studied the prevalence of antimicrobial resistant genes in chennai city.
13. Drug Design with Reinforcement Learning
Abhor Gupta
Abstract
De novo drug design is a computational approach to generating new molecular structures. Most research focuses on building the molecular graphs in an atom-by-atom process which tends to result in invalid and/or non-synthesizable molecules. Moreover, even if they are synthesizable, the path of synthesis is not clear from the generation process. We present ReactionRL, a simulator that simulates chemical reactions on molecules in a gym-compatible Reinforcement Learning environment. Using ReactionRL, users can build agents that optimize for their desired chemical properties. Not only does the use of RL allow exploration in the massive space of molecules, the nature of the environment enforces validity while also showing the path of synthesis during the generation process.
14. Deep Learning for Molecule Generation
Prajwal Sahu
Abstract
Deep Learning offers a fresh approach to the inverse molecular design problem. Various frameworks can be used for the de novo generation of molecules with desired properties for specific targets. RNNs have been successfully used for auto-regressive language models and can be used to generate the SMILES strings representation of molecules. We use a framework of RNN containing LSTM cells and then use transfer learning to improve the prediction for particular targets. An additional FSR (Functional Sub-Structure Representation) vocabulary consisting of functional groups generated using a sequential pattern mining algorithm is added to the model. We demonstrate the efficacy of this model by generating molecules fine-tuned on E. Coli inhibitors. The generated molecules were able to demonstrate activity against E. Coli. The model was also tested further on other datasets like BACE and BBBP. Currently, the usage of GPT models for molecule generation tasks is also being explored.
15. Chemically Interpretable Molecular Representation for Property Prediction
M S B Roshan
Abstract
Molecular property prediction using a molecule’s structure is a crucial step in drug and novel material discovery, as computational screening approaches rely on predicted properties to refine the existing de-sign of molecules. Although the problem has existed for decades, it has recently gained attention due to the advent of big data and deep learning. On average, one FDA drug is approved for 250 compounds entering the preclinical research stage, requiring screening of chemical libraries containing more than 20000 compounds. In-silico property prediction approaches using learnable representations increase the pace of development and reduce the cost of discovery. We propose developing molecule representations using functional groups in chemistry to address the problem of deciphering the relationship between a molecule’s structure and property. Functional groups are substructures in a molecule with distinctive chemical properties that influence its chemical characteristics. These substructures are found by (i) cu-rating functional groups annotated by chemists and (ii) mining a large corpus of molecules to extract frequent substructures using a pattern-mining algorithm. We show that the Functional Group Representation (FGR) framework beats state-of-the-art models on several benchmark datasets while ensuring explainability between the predicted property and molecular structure to experimentalists.
16. Design of star block copolymers via fusion of molecular dynamics simulation and machine learning
Vijith P
Abstract
Star block copolymers (SBCs) have potential applications as novel surfactants or amphiphiles for chemical transformations and separations. SBCs are macromolecules that comprise of chains of both hydrophilic and hydrophobic block copolymers that are covalently tethered via the hydrophobic blocks to a common node point, giving them an appearance similar to a “starfish”. Various parameters of these macromolecules should be tuned to obtain the desired surface properties, including number of arms, composition of the arms and the degree-of- polymerization of blocks (or the length of the arm). In this work, we use molecular dynamics (MD) simulations coupled with machine learning techniques to identify the SBC architecture that minimizes the interfacial tension between polar and non-polar solvents. To overcome the challenge of intractable search space because of the large number of plausible SBC architectures, we used different machine learning algorithms such as linear regression, polynomial regression, Gaussian process regression (GPR), Monte Carlo tree search (MCTS) and random forest(RF). The models are validated with medium design space and the best model was chosen for identifying sequences of large sized SBCs with low interfacial energies. Overall, this work provides an efficient approach to solve design problems using machine learning and provide important ground work for future experimental investigation of star copolymer sequences that could lower interfacial tension between polar and non-polar solvents.
17. Deep Learning for Predicting Chemical Reaction Outcome
Rishabh Shah
Abstract
Chemical reaction outcome prediction is a critical step in designing and optimizing novel chemical pathways. Recent advancements in deep learning have enabled the development of neural network models that are effective in understanding chemical reactions. Two existing approaches are molecular graph-based and natural language-based formulations. This work explores combining these approaches to improve performance. In addition, models like GPT and T5 have shown exceptional results on text generation tasks. The study utilizes the T5 architecture and transfer learning to generate product SMILES from reactants, treated as a seq-to-seq task. The models are trained on the publicly available MIT USPTO dataset. Currently, a constraint-based learning approach is also being formulated to make the model learn underlying constraints such as mass conservation in any reaction.
18. Self-Supervised Pretraining of Transformers for Molecular Property Prediction
M Bharathi Mozhian
Abstract
The prediction of molecular properties continues to be a challenging task with numerous potential applications notably in the field of drug discovery. The emergence of deep learning, coupled with the availability of large-scale data, has led to the development of powerful tools for constructing predictive models. The use of transfer learning has had a significant impact in the fields of natural language processing and computer vision, indicating its potential for molecular property prediction. In this study, transformers are employed for the purpose of molecular representation learning, with the aim of capitalizing on its robust downstream task transfer capabilities, as an alternative to the currently prevalent approaches of chemical fingerprints and graph neural networks. The work presents a pre-training procedure for molecular representation learning, which utilizes the publicly available PubChem SMILES data. The resulting pre-trained model, ChemBERTa, is subsequently fine- tuned and assessed on various classification tasks from MoleculeNet, pertaining to medicinal chemistry applications. The results imply that the model exhibits competitive downstream performance on MoleculeNet and scales well with pretraining dataset size.
19. Automatic detection of Renal Calculi
Abjasree S
Abstract
Kidney stones are a commonly overlooked disease that, if left untreated, can progress to chronic kidney disease that causes a buildup of toxic fluids that lead to various other complications. This poses a significant public health issue affecting millions of individuals worldwide. To provide assistance to doctors in addressing this issue, the RBCDSAI team, in collaboration with GKMC and KGVK Diagnostics, is working towards automating the detection of Renal Calculi through Artificial Intelligence applied to CT scan images. This will aid doctors in making quick decisions. However, our team is facing challenges due to the limited number of cases and the significant data imbalance, including class-wise and pixel-wise imbalances. To mitigate these imbalances, we have implemented several pipelines. Furthermore, we propose a pipeline that selects slices with kidneys, identifies the region of interest using HU thresholding, and detects whether the slices contain kidney stones or not, resulting in a high recall.
20. Eye Gaze Tracking Using Lensless Images
Akash Patil
Abstract
Eye gaze tracking is an emerging technology with multiple applications in psychology, marketing, humancomputer interaction and more. Traditional near-eye tracking systems use bulky lenses to focus and capture images of the eye. Deep learning networks are used to predict eye-gaze vectors from these images. Recently, there have been developments in reconstruction from lensless images using phase masks instead of lenses. This project tries to reduce the form factor of eye-tracking systems. we use a lensless camera with phase masks instead of bulky lenses for this purpose. We first tested the CNN-based network trained on simulated lensless eye-tracking datasets. We tested different strategies including passing multiple images ot the network, using segmentation and using polynomial regression for prediction of the gaze vector. We achieved the error in 1e-3 magnitude in the coordinates predicted by the vector at 1m of distance from the the origin. The future scope of this project includes accelerating the speed of the prediction by incorporating rolling shutter concepts with lensless imaging.
21. Edge-based Artificial Intelligence for Real-time Ophthalmic Image Analysis and Diagnosis
Vyshnav P
Abstract
The detec�on of eye diseases is cri�cal for �mely treatment. For example, Diabe�c re�nopathy is a leading cause of blindness worldwide, affec�ng millions of people with diabetes. Early detec�on and �mely interven�on can prevent vision loss and improve pa�ent outcomes. Fundus images have become a valuable tool in this regard, as they can provide insights into the underlying condi�ons. The analysis of these images can be �me-consuming and resource intensive due to the requirements of trained personnel. The lack of instruments and trained personnel leads to blindness in rural and remote areas. Recently, portable and mobile devices for capturing fundus images have been developed. Hence, developing Edge deployable deep-learning approaches automa�cally grading eye condi�ons from fundus images is useful for faster screening in remote areas. In this work, we present a deep learning-based approach for the automated detec�on of diabe�c re�nopathy from fundus images. We examine the different types of models used for this task, including custom models, ResNet34, ResNet50, AlexNet, and ResNet101. These models are ini�alized with pre-trained weights and trained on the Kaggle diabe�c re�nopathy dataset and the Kaggle APTOS dataset. The models are further tested on the Messidor dataset. We also expanded the model predictability to other common eye images (Cataracts, glaucoma, etc) by training it on Cataract Dataset and JSIEC dataset (1000 fundus images with 39 categories). In addi�on to the high-parameter models, we tried models that can run on smartphones such as MobilenetV3_small and MobilenetV3_large. We have done an interpretability analysis on the created diabe�c re�nopathy models to find out the important features used for the model to predict the output. The best model achieved an accuracy of 93%, a Sensi�vity of 96.63% on the APTOS dataset, and an accuracy of 89% on the combined dataset. The area under the ROC curve is 0.978 for the APTOS dataset and 0.913 for the combined dataset.
22. Edge-based Deep Learning Approach for Automatic Malaria Diagnostic from Slide Images using Mobile Devices
Shinde Shubham Sunil
Abstract
In remote and rural areas, delayed treatment of malaria often occurs due to poor and delayed diagnosis. Although the malaria diagnostic test is simple, traditional microscopic examination of blood smears is time-consuming and requires trained personnel. Deep Learning (DL) can assist in automatically detecting malaria by analysing slide images. However, computational requirements may make implementing such algorithms difficult in resource-constrained environments. The main objective of our research is to develop a DL approach for automatically detecting malaria-infected slides obtained from smartphone-based microscopes on the edge devices (Edge AI) such as smartphones. For this, we have used YOLOv7 object detection model to detect malaria-infected slides using two different datasets of microscopic RBC images. We have also applied image augmentation techniques to improve model performance during training. Our findings show that the fine-tuned YOLOv7 model accurately classifies the type of RBC with a mean average precision (mAP) over 0.84 for both datasets.
23. India Data Commons: A unified knowledge DB for all of Indian data
Senthamizhan V
Abstract
At a time when data informs our understanding of so many issues–from public health and education to the evolving workforce and more–access to data has never been more important. Our ability to use data to understand our world is frequently hampered by the difficulties in working with data. The difficulties of finding, cleaning and joining datasets effectively limit who gets to work with data.
India Data Commons (DC) addresses this challenge head on, performing the tedious tasks of curating, joining and cleaning data sets at scale so that data users don’t have to. The result is the large scale and cloud accessible APIs to clean and normalize data originating from some of the most widely used datasets, including those from the Population Census and Indian Ministries. By bringing all the Indian data under one schema, we facilitate the process of comparing statistical variables across geographical regions.
India Data Commons is an initiative by IIT Madras. Our knowledge database can be accessed through datacommons.iitm.ac.in.
24 Learning Dynamics of Soft and Hard Attention
Rahul Vashisht
Abstract
Attention models are typically learned by optimizing one of three standard loss functions called soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models-- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes the selected segment into the target label. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed.
25. Fast Kernel Methods
Ritesh Khan
Abstract
Kernel methods and estimation are very popular tools in statistical learning. In many kernel method algorithms (e.g., Support Vector Machine), we need to perform matrix-vector products multiple times. The naive matrix-vector product costs O(N2). Hence, the naive implementation of these algorithms is very costly, and the cost increases with the data set size (N) and the underlying dimension (large data sets and multiple features are very common). We propose a fast matrix-vector product algorithm (HODLRdD) to overcome this problem. HODLRdD can perform matrix-vector products almost linearly, to be precise the cost is O(N logα (N)), where α > 0. It is to be noted that there are many existing fast matrix-vector product algorithms. Among these most famous are Fast Multipole Method (FMM) and Treecode. But for higher dimensional problems, our HODLRdD algorithms can be a better option since it does not suffer from the curse of dimensionality like FMM or Treecode. Also, in a parallel environment, HODLRdD performs very well. We apply our HODLRdD algorithm to accelerate various kernel methods.
26. Parametric optimisation using a data-driven model to improve the energy efficiency of vanadium redox flow batteries
Ram Kishore S
Abstract
Renewable energy utilisation has become indispensable to attain carbon-zero goals in the near future. Given the intermittent nature of renewable energy sources, it is vital to have an energy- efficient and steadfast storage medium to maintain a stable energy supply. Vanadium redox flow batteries (VRFB) are promising energy storage devices that can be utilised effectively in grid storage applications. Despite their advantages, VRFBs still struggle to compete with lithium-ion batteries in round-trip efficiency. The round-trip efficiency of VRFB can be maximised if all the physical and chemical parameters are optimised simultaneously. To do this, several experimental trials need to be performed to determine the individual and collective effects of the parameters. Hence, in this work, we built a data-driven model for VRFB using literature data, and we revamped the model to be 90 % accurate in assessing the round-trip efficiency (energy efficiency) within the error range of ±2 %. Finally, we used the data-driven model to optimise the parameters for improving the energy efficiency to around 97–98 % for a 4 cm2 active area cell at a current density of 50 mA cm−2. Researchers can utilise this approach for any batteries that need performance improvement with minimal experimental efforts.
27. Combinatorial Optimization using GNNs
Aaradhy Sirothia
Abstract
Graph neural networks (GNNs) are being utilised as a way for solving canonical NP-hard problems, such as the graph partitioning and minimum vertex cover problem. The graph partitioning problem is solved in a multi-stage approach to determine optimal locations for offline flow measurements, minimizing associated measurement cost for leak detection. The minimum vertex cover problem is solved to identify optimal sensor placement locations within a water distribution network. To solve these NP-hard problems, a relaxation strategy is applied to the Hamiltonian, resulting in a differentiable loss function that is then used to train a GNN. The obtained results are projected to integer values. The GNN-based optimizer is outperforming existing solvers in terms of time efficiency for large variable numbers, while producing comparable results. The same framework is also being explored to solve the set cover problem for sensor selection.
28. A Software Tool for Identifying Kinetic Models from Concentration Data
Vignesh Kumar S
Abstract
This project introduces PyKineMod, a toolbox for determining kinetic model structures and corresponding parameters from concentration data for homogeneous reaction systems with inlet and outlet streams. The software implements the incremental identification approach and overcomes the limitation of combinatorial complexity while determining the model structure discrimination for multiple reactions. The software tool aims to solve optimization problems using incremental approaches, supporting rate-based and extent-based parameter estimation techniques. The software package includes some support for data preprocessing and determination of parameter confidence levels using the two approaches mentioned. The software is implemented in Python and utilizes standard numpy, scipy, and pandas packages, allowing for a simplified and Pythonic approach to determine kinetic parameters. The optimization problem is solved using scipy optimization solver, and confidence intervals are obtained using the bootstrapping technique.
29. Data extraction for road safety using deep learning
Shubha Bamney
Abstract
Road inventory data is collected manually to identify the road environment near crash locations. In this study, we automatically detect median on roads and presence of intersections from aerial images. Data is manually as well as automatically collected from two different cities (Chennai and Trichy) in the state of Tamil Nadu in India using Google Application Programming Interface (API). An image recognition model is built using Convolutional Neural Networks (CNN). Multiple models were built using ResNet architecture comprising training dataset from same city as of test set and mixed dataset of both the cities to test the model’s generalizability. F1 scores are used to rate the model performance. The results reveal that the model’s F1 scores increase when training data comprises images from both cities. This work makes two contributions. Firstly, it describes how CNN can be utilized for road safety research and secondly, the proposed dataset can be used in future to build the model for other cities so that manual data collection can be minimized
30. A Software tool for Identifying Kinetic Models from Spectral Data
Dumala Varshith Kumar
Abstract
KineParEs is a Python package that is used to estimate the kinetic parameters of a chemical reaction model from spectral reaction data. In this project, kinetic parameters are estimated directly from spectral data without using the pure component spectra. The optimization problem is solved using various optimizers with random initializations to obtain the minimum error. This tool can select the best reaction model from a set of given models and can estimate the kinetic parameters in that model. The software package additionally comes with a few data preprocessing methods and visualization plots for analyzing the data. It is entirely implemented in Python and uses standard NumPy, SciPy, and scikit-learn libraries.
31. Error in Variable Algorithm for Parameter Estimation in Water Distribution Networks
Akshaya Venkataramanan
Abstract
Estimation of friction loss coefficient and minor loss is of crucial importance when dealing with operation and control of Water Distribution Networks (WDNs). Inaccurate estimates of these parameters can result in the pressure loss across a pipeline to be overestimated or underestimated. This can lead to either the fluid not flowing across the pipeline or economic infeasibility. Also, in order to solve network problems like leak detection, blockage identification and scheduling for equitable distribution, control of on-off valves, pumps and on-off valves becomes necessary. Prior to the application of any simulation model, it is utmost important that the simulation model represents the actual characteristics of the real time WDN in consideration. This can be done by calibrating the model by using available flow and pressure measurements, which would require the minor and major loss coefficients. Conventional algorithms that uses Ordinary Least Squares (OLS) to compute the parameter estimates can lead to inaccurate results, due to the consideration of errors in the dependent variables. Since, the measured demand flow rate values from the sensors are more susceptible to measurement errors, we aim to minimize the sum of square of this error vector subject to the constraints that represent the model equations given by the Hazen-William Equations. This optimization problem is decomposed into several sub-problems depending on the number of control valve states by fixing the parameter vector and updating it, until the measurement errors are minimized. The model equations are framed such a way that assumes all the pipelines and exits in the considered topology, have the same roughness and the minor loss coefficients. The algorithm is implemented on a branched network with a source node and eight demand points on the Reconfigurable Test-bed for the Control and Operation of Water Distribution Networks (RTCOP-WDN) facility.
32. Active Learning with Human Heuristics: An Algorithm Robust to Labelling Bias
Sriram R
Abstract
Active learning enables prediction algorithms to achieve better performance with fewer data points by adaptively querying an oracle for output labels. In many instances, the oracle is a human. According to the behavioural sciences, humans provide labels by employing judgmental heuristics. How would modelling the oracle with such heuristics affect the performance of AL algorithms? We investigate three human heuristics (simplified linear model, fast-and frugal tree and Tallying) combined with fouractive learning algorithms (entropy-based, multi-view learning, density-based and novel density-based) and apply them to five datasets from domains such as health, wealth and sustainability. We replicate findings from the ecological-rationality literature on the match between dataset characteristics and heuristic performance. A first novel finding is that if a heuristic leads to significant labelling bias, the performance of active learning algorithms significantly drops, sometimes below random sampling. Thus, it is key to design active learning algorithms robust to labelling bias. Our second contribution is a novel density-based algorithm that achieves an overall median improvement of 31% over current algorithms when the oracle has a significant labelling bias. In sum, designing and benchmarking active learning algorithms should incorporate the modelling of human judgmental heuristics.
33. A LIDAR-based Single Sensor Collision Warning Framework for Indian Urban Traffic
Prajwal Shettigar J
Abstract
Driverless vehicle implementation in Indian traffic settings is complex due to its non- homogenous, laneless nature and non-uniform road conditions. Additionally, autonomous vehicles are a new technology that is unaffordable for most of the population. Therefore, this research proposes a single sensor-based collision warning framework for Indian urban traffic that is affordable and provides warnings and suggestions to the user to avoid potential collisions. The framework uses 3D light detection and ranging (LIDAR) for sensing the surroundings of the subject vehicle. The LIDAR point cloud data is processed through a combination of machine learning algorithms to detect obstacles and provide early warnings to the driver. The framework is validated using Indian urban traffic data collected using a test vehicle with onboard LIDAR and camera modules. The system tracks the state of surrounding obstacles and calculates the time to collision (TTC) using the estimates. The warnings are assessed based on the TTC and the subject vehicle condition, estimated using the normal distribution transform (NDT) algorithm. The trajectories of nearby vehicles and pedestrians are predicted using motion models to provide ahead-of-time warnings when obstacles and the subject vehicle are not on the same course. To reduce frequent warnings, the system utilizes the mode of instantaneous warnings, making it more user-friendly. The proposed system has the potential to enhance road safety and reduce accidents caused by human errors.
34. Portable and Non-Invasive Quality Assessment of Fruits using Pre-calibrated Spectral Sensor
Krithika Padmanabhan
Abstract
Fruit quality is assessed based on various internal and external parameters. ◦Brix is a commonly measured parameter that indicates the total soluble solids (TSS) present in fruits. The total soluble solids present in a fruit not only constitute sugars such as fructose, sucrose, and glucose but also organic acids and other compounds like minerals, fats, and amino acids. The Brix measured thus is a direct indication of the sweetness as well as the taste and quality of the fruit. For the prediction of Brix of fruits, we have developed an ML-based Brixmeter using a pre-calibrated spectral sensor chipset and an Arduino controller. The Sparkfun Triad Spectroscopy sensor (built using the AMS AS7265X sensors) is the spectral sensor used. This helps measure the interaction of 18 specific wavelengths (ranging from 400 nm to 940 nm) with the sample. The reflectance mode of spectroscopy is the principle of measurement. A machine learning model (Partial least-squares) between the spectra collected with the prototype built and the Brix value measured using a standard refractometer is presently developed for apples. This device can be extended to also predict Brix for other types of fruits as well their physiological maturity.
35. Control of Bioprocesses using NIR Spectroscopy and Recursive Calibration Updates
Keerthana. C
Abstract
Accomplishing Quality by Design (QbD) through Process Analytical Technology (PAT) is critical to ensure product quality and safety in bioprocess industries. However, realizing an optimal control combined with PAT is still challenging due to the inherent complexity of bioprocesses. Near-infrared (NIR) spectroscopy has been widely used for process monitoring, whereas its application in feedback control is scarce. In this work, in-situ NIR spectroscopy is used to rapidly measure spectra for performing feedback control of glucose concentration in a bioreactor for L. lactis NZ9000 fermentation. The concentration predictions are obtained from the calibration model built using partial least-squares (PLS) regression. A systematic approach to updating NIR calibration models using historical data has been proposed to address the batch-to-batch variations, a characteristic of biological processes. Subsequent updates in the calibration model have significantly improved the prediction of the metabolites and biomass concentration. L. lactis NZ9000 is used to demonstrate the proposed approaches in experimental studies. With the model-based controller design and recursive update of the calibration model, a tight control (within 2 % of the set-point) of glucose concentration during the fermentation is achieved. This can be further extended for performing advanced process control using NIR spectroscopy.
36. Proxy Parameter Correlation in Data Driven Monitoring – A Sensing Paradigm for Flow Measurements in WDNs
Rohit Raphael
Abstract
In this study, we demonstrate the efficiency and feasibility of sensing by proxy for estimating flow of water in pipelines. We propose a system which relates the proxy measurements with unknown flow rates inside pipelines in a water distribution network (WDN). The inference of flow rate inside the pipeline is based on readings given by sensors for other crucial parameters like level and vibration. This study involves an alternate approach which results in a parsimonious and accurate method. Economic feasibility and non-invasive nature of installation and maintenance are added advantages. This also require minimal infrastructure and labour expenses. Here we consider two parameters, vibration and wave turbulence on water surface for identifying flow of water in different parts of water distribution networks. By sensing the vibration on the pipelines carrying water, we can identify and quantify flow of water in the pipelines. This is achieved by using gyroscope sensors, which measures angular velocity and orientation of an object, which in turn translates the small movements caused by water flowing inside the pipelines into useful data. The ultrasonic level sensors implemented for measuring the level of water inside tanks give noise data due to surface turbulence while filling of water into the tank. If this data is isolated from the level data, we can estimate the time of filling. Combining these information with standard flow rate of pumps and uniform filling rates, we can estimate the flow rates and volume of water being transferred in the WDNs.
37. Optimal Sensor Placement for Monitoring and Leak Detection in Water Distribution Networks
Abhirami Elizbeth Prathap
Abstract
Water distribution systems consist of a sparsely instrumented, very complicated network of pipes and junctions. Any such system must effectively measure and regulate each process variable in order to be managed and operated. These process variables are measured by measurement tools called sensors. Modern enterprises wishing to utilise technology for the measurement of significant process variables for a number of applications face the issue of sensor placement. When performing maintenance and monitoring chores, such as measuring process variables like flow rates, temperature, pressure, etc., the best location for sensors is crucial. Leaks can also cause significant water loss because of ageing and poor maintenance. Since water distribution networks are extremely complex in real life, it is required to develop effective and reliable methodologies to support the design of the control systems. This work focuses on developing an efficient approach for leak detection and isolation in water distribution networks. Optimal placement of pressure sensors to completely detect any possible leaks in the water network is found out and mathematical proof is provided. The structural model framework approach is used for sensor placement which is easily scale-able to large networks. The method developed is also implemented on existing algorithms used for complete observability of the water network in order to meet user requirements. Focus is also given to developing an end-user-friendly web application for sensor placement in water networks.
38. ML for Onsite Food Quality Detection and Authentication using Spectroscopy
Aniruddha R Gandhewar
Abstract
Non-destructive techniques to evaluate food quality include ultrasound, visible and near-infrared (VIS–NIR) spectroscopy, hyperspectral imaging, and many more. VIS–NIR spectroscopy coupled with chemometrics (data-driven approach - machine learning), is the most widely used technique and has been effective in evaluating food quality. This approach is suitable for practical use in settings where there is a need for cost-effective and portable systems. The high-dimensional nature of datasets in this field poses a significant challenge for machine learning, often referred to as the curse of dimensionality. Despite this, techniques for effectively analysing high-dimensional data can still be found. In this study, we examine the two primary machine learning tasks: classification and regression on spectroscopic datasets. We investigate both traditional machine learning and contemporary deep learning methods. Our traditional machine learning methodology relies on denoising and dimensionality reduction techniques, whereas for our deep learning approach, we adopt the few-shot learning paradigm.
39. Reconfigurable Test bed for the control and operation of Water Distribution Network (WDN)
SRI HARI PRASATH R
Abstract
With the global stress on the paucity of water resources and myriad utilization of these sources, the optimal operation of the public infrastructure responsible for transporting potable water to the consumers also known as the intermittent Water Distribution Networks (WDNs) becomes crucial. This extends to catering equitable supply of water to the end nodes for different network operational conditions and predicting the consumption patterns. The test bed allows for the evaluation of different control strategies and operational scenarios in a controlled environment. The physical test bed consists of a network of pipes, valves, and different types of sensors such as Flow, Level and Pressure sensors. The software platform LabVIEW allows to gather data and simulate a real time WDN in a scaled down test bed for different network topologies and operational scenarios. This software platform was found to be flexible and extensible, allowing for the integration of new models and control strategies. The test bed is intended to evaluate the performance of different control strategies and operational scenarios, and to develop new approaches to the management of water distribution networks. It will also allow for the testing of new technologies and algorithms before implementation in real-world networks.
40. LOW COST NON-INVASIVE CAPACITANCE BASED LEVEL SENSING SOLUTIONS
Subhashree B
Abstract
Capacitive sensors are an appealing alternative to conventional level detection techniques that have been employed in liquid tanks or containers. A capacitive-based sensing system uses flexible, affordable sensors that can be attached to containers or placed outside. The capacitive sensors are unaffected by the turbulence or disturbance of water inside the tank as they make no contact with the liquid or solid being measured. It works based on the fringing effect and varying dielectric constant between the capacitor plates in parallel finger topology as the liquid level increases. By active shielding of the capacitor plates, the impact of parasitic capacitance and external interference can be reduced. The Out-of-Phase technique used in this system makes it independent of environmental factors like temperature, humidity and type of liquid used while maximising the signal-to-noise ratio and the system's overall robustness. The breakout board designed with the FDC1004 IC is used for capacitance to digital values conversion and the values are acquired by I2C communication. The capacitive sensor was tested using different sensor sheet materials like copper and aluminium and by using different wire configurations and studied. The sensitivity of the capacitive sensors was experimented by adjusting the width and spacing between the capacitance strip. This work is further extended by implementing the capacitive sensors in the scaled-down version of the water distribution networks for water level detection.
41. Real time optimization of continuous flow reactors by using low-field NMR spectroscopy
Hemalaxmi
Abstract
Continuous flow chemistry has become one of the most important techniques in the field of synthetic chemistry by employing flow micro-reactor. Figure 1 shows the schematic of the proposed NMR integrated continuous flow micro-reactors for optimization of chemical reactions. A cost-efficient inhouse built syringe pump system has been developed from scratch with 3D printed hardware components, electronics control circuit and LabView automation and tested for precise reagent delivery to the flow reactor. And the bench-top NMR spectroscopy has been converted into an online monitoring tool and operated via a remote-control API build on LabView for automated performance. The automated syringe pump and online NMR spectroscopy coupled with microflow reactors helps in real time optimization of chemical reactions [1,2]. This technology increases the safety, time efficiency, and accuracy of the reaction process and thereby reduces the number of experiments, time, and cost involved in the optimization operations.
42. Heuristic Algorithm for Informative State Discovery for Scheduling in Water Distribution Networks
Rajasundaram Mathiazhagan
Abstract
In many localities with limitations on the availability of water, Water Distribution Networks (WDNs) supply water only for a few hours in a day. In such intermittent water distribution networks, inefβicient operational policies can lead to inequitable supply of water. Scheduling is a complex problem that involves determining the optimal time and sequence of water supply, subject to constraints in the operation of the network, to ensure equitable supply of the available water. The scheduling problem makes use of the βlow measurements from the real network rather than a hydraulic model of the WDN. This reduces the effort required for model development and the errors arising out of it. However, the number of measurements to be taken rises exponentially with the number of demand nodes in the network. To reduce the number of measurements required, a heuristic algorithm is developed for choosing measurements that are informative. In each step of this iterative procedure, a schedule is generated using the available measurements and based on this, a new system state that has to be measured is identiβied. The iteration is completed when acceptable performance is achieved. The effectiveness of this algorithm is demonstrated by implementing it on the Reconβigurable Test-bed for the Control and Operation of Water Distribution Networks (RTCOP-WDN) facility.
43. Detecting Vehicles on the Edge: Knowledge Distillation to Improve Performance in Heterogeneous Road Traffic
Manoj Bharadhwaj
Abstract
The drastic growth in the number of vehicles in the last few decades has necessitated significantly better traffic management and planning. To manage traffic efficiently, traffic volume is an essential parameter. Most methods solve the vehicle counting problem under the assumption of state-of-the-art computation power. With the recent growth in cost-effective Internet of Things (IoT) devices and edge computing, several machine learning models are being tailored for such devices. Solving the traffic count problem on these devices will enable us to create a real-time dashboard of network-wide live traffic analytics. We propose a Detect-Track-Count (DTC) framework to count vehicles efficiently on edge devices. The proposed solution aims at improving the performance of tiny vehicle detection models using an ensemble knowledge distillation technique. Experimental results on multiple datasets show that the custom knowledge distillation setup helps better generalize a tiny object detector.
44. Are Models Trained on Indian Legal Data Fair?
Sahil Girhepuje
Abstract
Recent advances and applications of language technology and artificial intelligence have enabled much success across multiple domains like law, medical and mental health. AI-based Language Models, like Judgement Prediction, have recently been proposed for the legal sector. However, these models are strife with encoded social biases picked up from the training data. While bias and fairness have been studied across NLP, most studies primarily locate themselves within a Western context. In this work, we present an initial investigation of fairness from the Indian perspective in the legal domain. We highlight the propagation of learnt algorithmic biases in the bail prediction task for models trained on Hindi legal documents. We evaluate the fairness gap using demographic parity and show that a decision tree model trained for the bail prediction task has an overall fairness disparity of 0.237 between input features associated with Hindus and Muslims. Additionally, we highlight the need for further research and studies in the avenues of fairness/bias in applying AI in the legal sector with a specific focus on the Indian context.
45. BioBigBird: Leveraging Entire Articles for Biomedical Language Understanding
Vasudev Gupta
Abstract
The advancement of pretraining large neural language models, such as BERT, has resulted in significant strides in natural language processing. As a result, there is growing interest in utilizing pretraining in the biomedical field. Existing approaches, such as BioBERT and PubMedBERT, typically focus on smaller sequences, such as article abstracts, and fail to capture dependencies and knowledge across various sections. Recent studies in the general language domain have shown that utilizing longer sequences can provide context and enhance the model's learning capabilities. However, providing complete articles as input can lead to noise and hinder the model's ability to discern valuable information from irrelevant data. To address these limitations, we present BioBigBird, a domain-specific bidirectional language model pre-trained on extensive biomedical literature and clinical data. Unlike previous approaches, BioBigBird leverages entire articles up to 4096 tokens in length, thanks to the BigBird model's efficient handling of longer sequences. Moreover, we propose a multi-stage pretraining method to overcome the noise introduced by complete articles. Our comprehensive evaluation on the BLURB benchmark reveals that BioBigBird delivers highly competitive results compared to state-of-the-art models such as PubMedBERT, demonstrating its efficacy in processing biomedical text data. Our research makes a significant contribution to the growing body of knowledge on developing effective pretraining methods for domain-specific language models, with potential applications not only in biomedicine but also in other domains that require processing large volumes of textual data.
46. Generative Pseudo Labelling for Biological Information Retrieval
Reneeth Krishna MG
Abstract
Retrieving pertinent biomedical literature from PubMed is a crucial task for healthcare professionals and researchers. However, the effectiveness of retrieval systems can be restricted by the lack of labeled data and the variance of data across different topics or domains. In previous studies, the existing BERT-based dense retrievers were found to suffer from a significant performance drop under domain shifts. To address this challenge, we propose leveraging the Generative Pseudo Labeling (GPL) technique to enhance the retrieval performance of dense retrievers by enabling domain adaptation without labeled data. Our approach entails utilizing a generative model to produce pseudo labels for unlabeled data following which a cross-encoder is used to ensure the quality of the generated labels. Our experimental findings indicate that the proposed GPL-based method offers notable improvements in retrieval performance, particularly for biological datasets with a limited labeled data or significant domain differences. Our study’s primary contribution is labeling the vast unlabelled PubMed abstract corpus, which can potentially pave the way for domain adaptation of numerous dense retrievers.
47. Towards Better Evaluation of Natural Language Generation
Ananya Sai
Abstract
Recent advances in machine learning research have increased interest in a wide range of automatic text-generation tasks. Their applications span translation systems such as Google translate & Bing translator which have existed since the early 2000s, to the more recent chatbots / search-engine assistants such as ChatGPT, Bard, which are creating the current technology headlines for their ability to summarize documents, write poems, op-ed articles, code, among other things. However these systems are not always reliable and can raise concerns. Diligent evaluation is also necessary to measure the scientific progress being made & decide which system is better. We not only have several text generation systems from industries and academic labs for various tasks, but also have multiple versions of each system, trained with hundreds of combinations of settings, if not more. Ideally we want expert humans to evaluate the outputs of these systems in multiple scenarios to decide their ranking. However practically this becomes infeasible to keep up with the rapid rate of developments in this area. The research community has hence been largely relying on automatic evaluation metrics rather than human evaluations. The automatic metrics while being quick & cost-effective are not always accurate. The search for more reliable metrics is on-going with more than 50 metrics being proposed to evaluate different text generation tasks in the last 6 years. Our research focuses on these automatic evaluation metrics used for evaluating generated texts:
● What are the available metrics? Which ones are apt for a given scenario ?
We propose a taxonomy to categorize the existing metrics and help choose relevant metrics
based on a specific task & end-goal.
● How robust and reliable are these metrics?
We propose techniques to analyze metrics & show their shortcomings and the potential of an
adversarial system gaming the metric to get undeserved high scores.
● How can we enhance automatic evaluation metrics?
We propose techniques and richer datasets towards developing better metrics addressing the
lack of multiple references and adversarial examples in the existing approaches
Swathi
Abstract
In traditional reinforcement learning (RL) problems, agents typically learn by getting rewarded or penalized for desirable and undesirable actions respectively. Traditional RL approaches have been applied successfully in different applications [1]. However, in some cases, it is preferable that the agent does not have to “make mistakes” during training before learning that action is undesirable. For example, while training a robot for autonomous driving, it is preferable that the robot does not have to take an unsafe action (such as falling off a cliff) before learning that it is undesirable. This is particularly relevant in applications that involve human- in-loop where the cost of an unsafe action is very high, even during training. In such cases, the agent can be given prior information about the safety of the various states and actions by human experts. Now, the agent will learn an optimal policy with safety constraints. In this work, this constrained RL with human intervention on unsafe actions is called the safe human- in-loop reinforcement learning problem.The objective of this work is to formulate the safe human-in-loop reinforcement learning problem under the RL framework with constraints and to investigate the effects of this kind of training on the performance of the agent. In the previous work, different approaches to safe and human-in-loop RL include estimating safe sets using Gaussian processes [2], [4], defining surrogate Markov Decision Processes with an intervention policy [3], and using a model-based approach for predicting the near- future while action-selection [5]. In this paper, we propose two different approaches to incorporate the notation of safety based on policy iteration. Based on these two approaches, we propose safe Q-learning and Partially safe Q-learning algorithms. The proposed algorithms are applied to Grid and the results are obtained during the simulation of the same on various environments.
49. Physics-Informed Model-Based Reinforcement Learning
Adithya Ramesh
Abstract
We apply reinforcement learning (RL) to physical systems, in particular, robotic systems undergoing rigid body motion without contacts, such as N-pendulums and Cart-N-poles. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve the sample efficiency is model-based RL. In our model-based RL algorithm, we learn a model of the environment, essentially its transition dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better model-based RL performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, by utilizing the structure of the underlying physics. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in model-based RL, model accuracy mainly matters in environments that are sensitive to initial conditions, measured using the Lyapunov exponent. In these environments, the physics-informed version of our algorithm achieves significantly better average-return and sample efficiency. In environments that are not sensitive to initial conditions, both versions of our algorithm achieve similar average-return, while the physics-informed version achieves better sample efficiency. We also show that, in challenging environments, where we need a lot of samples to learn, physics-informed model-based RL can achieve better average-return than state-of-the-art model-free RL algorithms such as Soft Actor-Critic, by generating accurate imaginary data. In the next phase of our work, we will be focusing on robotic systems undergoing rigid body motion with contacts, such as robotic manipulators and legged robots.
Siddharth Nishtala
Abstract
Trees have emerged as the most popular choice of intrinsically interpretable models to represent reinforcement learning policies. Due to the challenges associated with learning a tree policy directly, various approaches have leveraged neural network policies to generate datasets that can be used to train tree-based models in a supervised manner. However, existing approaches have largely ignored that when learning through such methods, the tree policy fails to account for how good or bad a given action is. In this work, we show that accounting for action values and incorporating them as misclassification costs during the tree training process can generate tree policies with better performance. We propose CS-VIPER, a cost-sensitive variant of the VIPER algorithm and empirically demonstrate that our method outperforms VIPER across environments.
51. Continuous Tactical Optimism and Pessimism
Kartik
Abstract
In the field of reinforcement learning for continuous control, deep off-policy actor-critic algorithms have become a popular approach due to their ability to address function approximation errors through the use of pessimistic value updates. However, this pessimism can reduce exploration, which is typically seen as beneficial for learning in uncertain environments. Tactical Optimism and Pessimism (TOP) proposed an actor-critic framework that dynamically ad- justs the degree of optimism used in value learning based on the task and learning stage. However, their fixed bandit framework acts as a hyper-parameter for each task. We need to consider two hyperparameters: the number of arms and arm values. To simplify this problem, we consider learning the degree of optimism 𝛽 while training the agent in the environment. We demonstrate that this approach outperforms other methods that use a fixed level of optimism in a series of continuous control tasks in Walker2d-v2 and HalfCheetah-v2 environments, and can be easily implemented in various off-policy algorithms. We call our algorithm: cTOP or continuous TOP.
52. Optimization of Sample Size for Go/No-go Experimental Settings
Anusha Kumar
Abstract
Experimentation techniques have increasingly shown to improve the performance of many real-world systems. Combining such techniques with data-driven strategies can help in understanding the importance of exploring a system before choosing to exploit the findings. The inherent uncertainty in the contextual environment has also raised a need to differentiate between decision strategies used for specific strata of the experimental population. This research addresses these concerns for sample recommendations in an online experimental setting with go/no-go decisions. We study problems that require a one-shot experimental phase followed by a targeted form of exploitation over a heterogeneous population. We assume a Bayesian framework, with factor effects having distributional priors and related to a response variable through a General Linear Model. We first formulate the expected benefit from knowing the true distribution of the effects. We then extensively consider the effect of imperfect information on the expected benefit for each category of the population and suggest some key ideas for practitioners. We propose an algorithm to compute the recommended sample size based on prior beliefs by optimizing the expected benefit.
53. Algorithmic Trading with Artificially Intelligent Agents: Reinforcement Learning in Action
Anshuman Senapati
Abstract
Increased leveraging of intelligent algorithms by firms for pricing their goods and services is recently bringing up concerns of emergence of a certain class of undesirable market behavior on the part of regulatory authorities. To this end, it has been empirically shown that even the most basic reinforcement learning algorithms acting as players in simple economic settings have the capacity to exhibit collusive behavior despite an absence of explicit communication channels. We document similar behavior in the securities market where algorithmic trading dominates the trading volume and has important implications for price discovery and efficiency of the market. We also move beyond simple Q-learning as used in the literature of pricing for goods and services and use sophisticated state-of-the-art policy gradient agents operating in more challenging and continuous spaces that come with more realistic settings of the securities market. For this, we let multiple Soft Actor-Critic (SAC) algorithms act as both speculators with and without private information in the Kyle (1989) market model extended to incorporate multiple periods of trading. We provide empirical results to claim that such interaction leads to a marked deviation from the analytical single period behavior of the original work. We further show that this behavior correlates with the uncertainty in the market and a progression towards deterministic and less challenging environments aggravates the situation.
54. Steps in Object Detection and Image Segmentation
Ravindra kumar
Abstract
In this report, we are going to state the steps needed to perform object detection and image segmentation task. We will be suggesting a few of the model selection steps and popular techniques. We will learn about many models and their evolution. We will compare many models with the mean Average Accuracy, mIoU, FPS, and number of Parameters. We will be presenting the observations and results from the experiment performed. Lastly, we will be presenting an overall observation. In the end, we will conclude and state the future works possible.
55. A Survey on Churn Prediction Techniques and Tim Series Forecasting
Karthikeyan S
Abstract
As every business has become more competitive, it has become essential for companies to strive towards further development and growth to survive in the industry. Businesses have become more competitive, and having a solid customer base has become more critical. Acquisition of new customers and retaining existing customers are very important to retain their business. But acquiring new customers has been saturated in many domains where every person is already a customer of some company, and thus retaining existing customers is crucial. Thus, identifying and handling potential churns in the customer base and handling them has become more significant. Similarly, businesses need to make informed decisions to achieve their goals properly. Organizations must constantly adapt to market trends and consumer behavior changes in today’s fast-paced business environment. Time series forecasting enables organizations to analyze historical data and predict future trends, allowing them to make informed decisions about resource allocation, product development, and marketing strategies. Companies strive to create data-driven business decisions that require a variety of Algorithms and techniques to propel additional growth and progression. But numerous algorithms and techniques can apply to the above-discussed tasks; it requires lots of prerequisite knowledge to pick their way through all the sophisticated algorithms to solve their business needs. This work aims at developing reference guideline flowcharts through an extensive technical survey conducted on the above-mentioned tasks, namely Churn Prediction and Time Series Forecasting, which includes a comprehensive study of various available research and techniques on the problem of customers churning over various business domains. The flowcharts are designed to be the ideal reference for anyone with a business need to quickly sort through the algorithms and models for their demands as companies and enterprises grow and expand.
56. A Survey on Customer Segmentation and Natural Language Tasks
N Kausik
Abstract
As companies and businesses grow and expand, the data that they obtain and generate also increase. To fuel further development and growth, companies try to make data-driven business decisions that require the use of various algorithms, machine learning and AI techniques. Over the years there have been numerous algorithms and research published on all the problems, however for any uninformed person it is very difficult to navigate through all the different complex algorithms to effectively solve their business problem or task. Through this project, we address this issue by developing a set of rules and decisions to select the best algorithm or model for two areas, namely Customer Segmentation and Natural Language Tasks (8 tasks). The rules are condensed into easy-to-use flowcharts so that anyone with a business need can easily obtain the references and algorithms or models that are best suited for their needs.
57. A Survey on Recommender System, Click Through Prediction, Search and Ranking
Kankan Jana
Abstract
With the advent of internet technologies, we deal with the challenges of information overload. In such scenarios, Information filtering systems have become crucial to every commercial application. Recommender systems, click-Through rate Prediction(CTR), and search and ranking are some subclasses of filtering systems that are critically important in various industries, particularly those that rely heavily on online presence and digital presence marketing. Recommender systems predict user preferences and make personalized recommendations, while click-through prediction predicts the likelihood of a user clicking on a specific item. Search engines play a vital role in information retrieval, and ranking techniques determine the order of results presented to the user. Numerous research papers have been published in the last decade, and many new algorithms and approaches have been proposed on these topics. Choosing any perfect method from the various options available becomes challenging. Through this survey, we reviewed multiple Research papers, APIs, Evaluation metrics, and Experimental frameworks. The ultimate goal is to devise a concrete to-do list to frame the problem and provide a recipe to solve the problem.
58. Hybrid Modeling Approach for Data Augmentation of Multivariate Time Series
Shubham Kashyapi
Abstract
Time series analysis has been used to model a system and make predictions in various domains. Although modern machine learning (ML) and deep learning (DL) methods can successfully model time series, they require substantial data for training. However, obtaining vast amounts of data from engineering systems such as aircraft may not be feasible. Thus, it is important to develop data augmentation techniques to generate synthetic time series given very little training data. The data augmentation problem is even more challenging when the multivariate time series contains both, continuous and discrete variables. The conventional augmentation techniques based on random transformations do not model the underlying dependencies between the variables. However, deep learning techniques learn the distribution of the data and then generate synthetic data from this distribution. In this work, we develop a novel deep generative model combined with domain knowledge to model the system’s dynamics. Moreover, we demonstrate that the generated synthetic data effectively captures the physical dynamics and the control loops of aircraft time series.
59. Optimizing Traffic Control with Model-Based Learning: A Pessimistic Approach to Data-Efficient Policy Inference
C Siddarth
Abstract
Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic signal control, the work so far has focussed on learning through simulations which could lead to inaccuracies due to simplifying assumptions. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize from the experience data much better than others. We build a model-based learning framework which infers a Markov Decision Process (MDP) from a dataset collected using a cyclic traffic signal control policy that is both commonplace and easy to gather. The MDP is built with pessimistic costs to manage out-of-distribution scenarios using an adaptive shaping of rewards which is shown to provide better regularization compared to the prior related work in addition to being PAC-optimal. Our model is evaluated on a complex signalized roundabout showing that it is possible to build highly performant traffic control policies in a data efficient manner.