Publications

conferences

Radha Chitta, Alexander Hydek, "A Reliable and Accurate Multiple Choice Question Answering System for Due Diligence", International Conference on Artificial Intelligence and Law,

Fenglong Ma, Radha Chitta, Saurabh Kataria, Jing Zhou, Palghat Ramesh, Tong Sun, Jing Gao, "Long-Term Memory Networks for Question Answering", IJCAI Worskhop on Semantic Machine Learning, Melbourne, Australia - August 19 - 25, 2017. [pdf]

Abstract - Question answering is an important and difficult task in the natural language processing domain, because many basic natural language processing tasks can be cast into a question answering task. Several deep neural network architectures have been developed recently, which employ memory and inference components to memorize and reason over text information, and generate answers to questions. However, a major drawback of many such models is that they are capable of only generating single-word answers. In addition, they require large amount of training data to generate accurate answers. In this paper, we introduce the Long Term Memory Network (LTMN), which incorporates both an external memory module and a Long Short-Term Memory (LSTM) module to comprehend the input data and generate multi-word answers. The LTMN model can be trained end-to-end using back-propagation and requires minimal supervision. We test our model on two synthetic data sets (based on Facebook’s bAbI data set) and the real-world Stanford question answering data set, and show that it can achieve state-of-the-art performance.

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, Jing Gao, "Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks", Conference of Knowledge Discovery and Data Mining, Halifax, Nova Scotia - August 13 - 17, 2017. [pdf]

Abstract - Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the temporality and high dimensionality of sequential EHR data and to interpret the prediction results. Existing work solves this problem by employing recurrent neural networks (RNNs) to model EHR data and utilizing simple attention mechanism to interpret the results. However, RNN-based approaches suffer from the problem that the performance of RNN drops when the size of sequences is large, and the relationships between subsequent visits are ignored by current RNN-based approaches. To address these issues, we propose Dipole, an end-to-end, simple and robust model for predicting patients’ future health information. Dipole employs bidirectional recurrent neural networks to remember all the information of both the past visits and the future visits, and it introduces three attention mechanisms to measure the relationships of different visits for the prediction. With the attention mechanisms, Dipole can interpret the prediction results efficiently. Dipole also allows us to interpret the learned medical code representations, which are confirmed positively by medical experts. Experimental results on two real world EHR datasets show that the proposed Dipole can significantly improve the prediction accuracy compared with the state-of-the-art diagnosis prediction approaches and provide clinically meaningful interpretation.

Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, Jiebo Luo, "Deep Multimodal Representation Learning From Temporal Data", IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 22 - 25, 2017. [pdf]

Abstract - In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video- and sensor-based activity classification, and audiovisual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.

Radha Chitta, Palghat Ramesh, Jing Zhou, Saurabh Kataria, Tong Sun, "Modeling Disease Progression Using Recurrent Neural Networks", Workshop on Knowledge Discovery in Healthcare Data held in conjunction with IJCAI, New York City, New York, July 10th, 2016. [pdf] [poster]

Abstract - Disease progression models can be used to predict the incidence and changes in disease status over time, leading to strategies for the prevention and slowing the advancement of chronic diseases. The state-of-the-art disease progression models are not sufficiently generic and scalable, cannot model multiple diseases, and require data specific to the disease being modeled. Our objective is to build a generic scalable latent patient model which can be employed to answer several questions about the patient’s current and future health status. We use a Long Short-Term Memory Recurrent Neural Network to analyze the medical claims data, and develop a latent model to represent the patients’ health status and transitions. We present our preliminary work on predicting the incidence of complications in diabetic patients, using this model, which can be used to predict the rate of disease progression.

Yuqing Xing, Jing Zhou, Radha Chitta, Palghat Ramesh, Tong Sun, "Patient Profiling with Latent Medical Concepts", Workshop on Knowledge Discovery in Healthcare Data held in conjunction with IJCAI, New York City, New York, July 10th, 2016. [pdf]

Abstract - The current healthcare systems is not sustainable due to the rapidly rising cost of care, particularly the growing burden of chronic conditions. Analytics built on healthcare data showed great promise towards providing better quality of care at lower costs. While many existing analytics achieved good accuracy, usually they are restricted to a specific end application and offer limited actionable insight for healthcare professionals to improve outcome. In this paper, we present our preliminary work on latent representation of patients using restricted Boltzmann machine (RBM). RBM is an unsupervised generative model that can learn the intrinsic complex structure from the raw data without human annotation. The learned latent presentation is less sensitive to noise and missing data points. We discovered that many elements in the latent presentation are associated with clinical meaningful concepts, which are very attractive for healthcare professionals since they can interpret them and take actions. We demonstrated that RBM inferred the probabilities of the intentionally removed diagnoses, procedures, and drugs based on the learned joint probability distribution. Finally we showed the latent presentation is better than raw features in hospitalization prediction in terms of interpretability and accuracy.

Radha Chitta, Rong Jin, and Anil Jain, "Stream Clustering: Efﬁcient Kernel-based Approximation using Importance Sampling", Data Science and Big Data Analytics Workshop held in conjunction with ICDM, Atlantic City, New Jersey, November 14-17, 2015. [pdf] [slides]

Abstract - Stream clustering methods, which group continuous, temporally ordered dynamic data instances, have been used in a number of applications such as stock market analysis, network analysis, and cosmological analysis. Most of the popular stream clustering algorithms are linear in nature, i.e. they assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to deﬁne the inter-point similarity. Though these linear clustering algorithms are efﬁcient, they do no achieve acceptable clustering accuracy on real-world data. Kernel based clustering algorithms, which use non-linear similarity measures, yield better clustering quality, but are unsuitable for clustering data streams due to their high running time and memory complexity. We propose an efﬁcient kernel-based clustering algorithm, called the Approximate Stream Kernel k-means, which uses importance sampling to sample a subset of the data stream, and clusters the entire stream based on each data point’s similarity to the sampled data points in real-time. Every time a data point is sampled, the kernel matrix representing the similarity between the sampled points is updated, and projected to a low dimensional space spanned by its top eigenvectors. The data points are clustered in this low-dimensional space using the k-means algorithm. Thus, the Approximate Stream Kernel k-means algorithm performs clustering in linear time using kernel-based similarity. We show that only a small number of points need to be sampled from the data stream, and the resulting approximation error is well-bounded. Using several large benchmark data sets to simulate data streams, we demonstrate that the proposed algorithm achieves a signiﬁcant speedup over other kernel-based clustering algorithms with minimal loss in clustering accuracy. We also demonstrate the practical applicability of the proposed algorithm by using it to efﬁciently ﬁnd trending topics in the Twitter stream data.

Radha Chitta, Anil Jain, and Rong Jin, "Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters", PhD Workshop at CIKM, Melbourne, Australia, October 19-23, 2015. [pdf] [Tech Report] [slides]

Abstract - In clustering applications involving documents and images, in addition to the large number of data points (N) and their high dimensionality (d), the number of clusters (C) into which the data need to be partitioned is also large. Kernel-based clustering algorithms, which have been shown to perform better than linear clustering algorithms, have high running time complexity in terms of N, d and C. We propose an eﬃcient sparse kernel k-means clustering algorithm, which incrementally samples the most informative points from the data set using importance sampling, and constructs a sparse kernel matrix using these sampled points. Each row in this matrix corresponds to a data point’s similarity with its p-nearest neighbors among the sampled points (p ≪ N). This sparse kernel matrix is used to perform clustering and obtain the cluster labels. This combination of sampling and sparsity reduces both the running time and memory complexity of kernel clustering. In order to further enhance its eﬃciency, the proposed algorithm projects the data on to the top C eigenvectors of the sparse kernel matrix and clusters these eigenvectors using a modiﬁed k-means algorithm. The running time of the proposed sparse kernel k-means algorithm is linear in N and d, and logarithmic in C. We show analytically that only a small number of points need to be sampled from the data set, and the resulting approximation error is well-bounded. We demonstrate, using several large high-dimensional text and image data sets, that the proposed algorithm is signiﬁcantly faster than classical kernel-based clustering algorithms, while maintaining clustering quality.

Radha Chitta, Rong Jin, and Anil Jain, "Efficient Kernel Clustering Using Random Fourier Features", ICDM, Brussels, Belgium, December 10-13, 2012. [pdf] [slides]

Abstract - Kernel clustering algorithms have the ability to capture the non-linear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them non-scalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a low-dimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by state-of-the-art kernel clustering algorithms.

Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain, "Approximate Kernel k-means: solution to Large Scale Kernel Clustering", KDD, San Diego, CA, August 21-24, 2011. [pdf]

Abstract - Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernel-based clustering algorithms can capture the non-linear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel k-means, termed approximate kernel k-means, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with dramatically reduced run-time complexity and memory requirements.

Timothy C. Havens, Radha Chitta, Anil K. Jain, and Rong Jin, "Approximation of Fuzzy and Possibilistic Kernel c-Means for Efficient Clustering of Large-Scale Datasets", FUZZ-IEEE, Taipei, Taiwan, June 27-30, 2011. [pdf]

Abstract - The ubiquity of personal computing technology has produced an abundance of staggeringly large data sets—the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs over 25 terabytes of data per day. There is a great need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these data sets. Clustering is one way to find these groups. In this paper, we propose an approximation method for the fuzzy and possibilistic kernel c-means clustering algorithms. Our approximation constrains the cluster centers to be linear combinations of a size m randomly selected subset of the n input objects, where m << n. The proposed algorithm only requires an m × n rectangular portion of the full n × n kernel matrix and the n diagonal values, resulting in significant memory savings. Furthermore, the computational complexity of the c-means algorithm is substantially reduced. We demonstrate that up to 3 orders of magnitude of speedup are possible while achieving almost the same performance as the original kernel c-means algorithm.

Bhaskar Mitra, Radha Chitta, and Bharati Raghavan, Automatic identification of dynamic virtual communities of practice based on browsing patterns, Paper presented in the First National Conference on Unstructured Data Management, India, 2006. [pdf]

Abstract - Highly interactive computer mediated communication has facilitated the formation of various forms of virtual learning communities on the Web that have existed over the years including communities of practice and Internet discussion groups engaged in interactive sharing and learning processes. However, the process of community formation or registration based on user's interest structure is still fundamentally a manual process. In this paper, we present a novel approach to identifying virtual communities of practice that exist on the Web by examining the browsing patterns of online users using a neural network model.

Journals

Fenglong Ma, Yaqing Wang, Houping Xiao, Ye Yuan, Radha Chitta, Jing Zhou, Jing Gao, "Incorporating medical code descriptions for diagnosis prediction in healthcare", BMC Medical Informatics and Decision Making, Volume 19, Article number: 267, 2019. [pdf]

Abstract - Diagnosis aims to predict the future health status of patients according to their historical electronic health records (EHR), which is an important yet challenging task in healthcare informatics. Existing diagnosis prediction approaches mainly employ recurrent neural networks (RNN) with attention mechanisms to make predictions. However, these approaches ignore the importance of code descriptions, i.e., the medical definitions of diagnosis codes. We believe that taking diagnosis code descriptions into account can help the state-of-the-art models not only to learn meaning code representations, but also to improve the predictive performance, especially when the EHR data are insufficient.

Anil Jain, Rong Jin, and Radha Chitta, "Semi-supervised Clustering", in Handbook of Cluster Analysis, Chapman & Hall, 2014. [CRC Press] [Amazon]

Abstract -Clustering is an unsupervised learning problem whose objective is to ﬁnd a partition of the given data. However, a major challenge in clustering is to deﬁne an appropriate objective function in order to to ﬁnd an optimal partition that is useful to the user. To facilitate data clustering, it has been suggested that the user provide some supplementary information about the data (eg. pairwise relationships between few data points), which when incorporated in the clustering process, could lead to a better data partition. Semi-supervised clustering algorithms attempt to improve clustering performance by utilizing this supplementary information. In this chapter, we present an overview of semi-supervised clustering techniques and describe some prominent algorithms in the literature. We also present several applications of semi-supervised clustering.

Radha Chitta and M. Narasimha Murty, Two-level k-means clustering algorithm for k–t relationship establishment and linear-time classification, Volume 3, Issue 43 (pages 796-804), Pattern Recognition , 2010. [pdf]

Abstract - Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.

Radha Chitta and M. Narasimha Murty, Pattern Synthesis is SVM Based Classifier, Book Chapter, Encyclopedia of Data Mining and Data Warehousing (Editor: John Wang), 2009. [pdf]

Abstract - An important problem in pattern recognition is that of pattern classification. The objective of classification is to determine a discriminant function which is consistent with the given training examples and performs reasonably well on an unlabeled test set of examples. The degree of performance of the classifier on the test examples, known as its generalization performance, is an important issue in the design of the classifier. It has been established that a good generalization performance can be achieved by providing the learner with a sufficiently large number of discriminative training examples. However, in many domains, it is infeasible or expensive to obtain a sufficiently large training set. Various mechanisms have been proposed in literature to combat this problem. Active Learning techniques (Angluin, 1998; Seung, Opper, & Sompolinsky, 1992) reduce the number of training examples required by carefully choosing discriminative training examples. Bootstrapping (Efron, 1979; Hamamoto, Uchimura & Tomita, 1997) and other pattern synthesis techniques generate a synthetic training set from the given training set. We present some of these techniques and propose some general mechanisms for pattern synthesis.

patents

Fenglong Ma, Radha Chitta, Jing Zhou, Palghat S Ramesh, Tong Sun, Saurabh Singh Kataria, "Long-term memory networks for knowledge extraction from text and publications", US Patent 2021. [pdf]
Radha Chitta and Alexander Hudek, "Text extraction, in particular table extraction from electronic documents", US Patent App 2021. [pdf]
Radha Chitta and Alexander Hudek, "System and method for applying artificial intelligence techniques to respond to multiple choice questions", US Patent App 2020. [pdf]