|
Data Mining Algorithms for CGH Data
Numerical and structural chromosomal imbalances are one of the most prominent and pathogenetically relevant features of neoplastic cells. One method for measuring genomic aberrations is Comparative Genomic Hybridization (CGH). CGH is a analysis method for detecting regions with genomic imbalances (gains or losses of DNA segments).I am developing novel data mining based algorithms for aiding cancer detection and classification sing Comparative Genomic Hybridization (CGH). The goal is develop techniques to find the key genetic intervals that can predict the type of cancer. Our preliminary results show that we can significantly improve on existing techniques. Learning Correlations in High-dimensional Data using Mixture Modeling Using a mixture of random variables to model data is a tried-and-true method common in data mining, machine learning, and statistics. By using mixture modeling it is often possible to accurately model even complex, multi-modal data via very simple components. However, a significant but under-appreciated problem with mixture modeling is that as the dimensionality of the data increases, interpreting the meaning of the underlying mixture components becomes more and more difficult. We have developed a fundamentally different alternative – Mixture-Of-Subsets (MOS) Model, with the aim of making the model more informative and easy to understand, as well as allowing the model to more closely approximate the underlying reality. We do this by making two fundamental changes to the classical mixture model. First, we allow a data point to be generated by a set of components rather than just a single component. Next, we limit the number of data attributes that each component can influence. We have also developed an EM framework to learn the MOS model from a dataset, and experimentally evaluated it on real, high-dimensional datasets. Our results show that the MOS model learned from the data is easy to understand and interpret, and represents the underlying nature of the data accurately. Data Mining for Microbial Resistance
The current arsenal of antimicrobial or antibiotic drugs for treating bacterial infection is one of the most important public health tools available, but it is not an inexhaustible resource. The more haphazardly antimicrobial drugs are used, the more the targeted pathogens develop resistance. Once a pathogen develops resistance to all of the available drugs, treating an infected patient may become difficult or impossible. This project is a collaboration between computer scientists and health scientists aimed at developing data mining tools for discovering when and why antimicrobial resistance appears in nosocomial (hospital acquired) infections.
Publications Jun Liu, Nirmalya Bandyopadhyay, Sanjay Ranka, Michael Baudis, Tamer Kahveci. Inferring Progression Models for CGH data, Journal of Bioinformatics, to appear
Jun Liu, Jaaved Mohammed, James Carter, Sanjay Ranka, Tamer Kahveci, and Michael Baudis, Distance-based clustering of CGH data. Bioinformatics, 22(16):1971–1978, 2006.
Jun Liu, Sanjay Ranka, and Tamer Kahveci,. Markers improve clustering of CGH data, Bioinformatics. 2007 Feb 15;23(4):450-7. Christopher M. Jermaine, Subramanian Arumugam, Abhijit Pol, Alin Dobra. "Scalable approximate query processing with the DBO engine," Proceedings of the ACM SIGMOD International Conference on Management of Data, 2007, p. 725.
John Gums, Sanjay Ranka, Chris Jermaine. "Heterogenetiy in Resistance Trends Greatest in Large Hospitals: Results of the Antimicrobial Resistance Management Program," 47th Annual Interscience Conference on Antimicrobial Agents and Chemotherapy (ICAAC), v.47, 2007.
John Gums, Sanjay Ranka, Christopher Jermaine. "Significant Heterogeneity Found in Resistance Trends Between Hospitals: Results of the Antimicrobial Resistance Management Program," 47th Annual Meeting of the Infectious Diseases Society of America (IDSA 2007), v.47, 2007.
Manas Somaiya, Christopher M. Jermaine, Sanjay Ranka. "Learning correlations using the mixture-of-subsets model," ACM Transaction on Knowledge Discovery in Data, v.1, 2008.
Mingxi Wu, Christopher M. Jermaine. "A Bayesian Method for Guessing the Extreme Values in a Data Set," Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB 2008), v.33, 2008, p. 471.
S.M. Smith, J.G. Gums, C. Jermaine, S. Ranka. "The Implications of Phenotypic Clustering of Antimicrobial Resistance Patterns on Predicting Future Trends," 48th Annual ICAAC/46th Annual IDSA Meeting, Washington DC, 2008.
X. Song, Chris Jermaine, Sanjay Ranka, John Gums. "A Beyesian Mixture Model with Linear Regression Mixing Proportions," Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, v.14, 2008.
Xiuyao Song, Mingxi Wu, Christopher M. Jermaine, Sanjay Ranka. "Conditional Anomaly Detection," IEEE Trans. Knowl. Data Eng, v.19 (5), 2007, p. 631.
Xiuyao Song, Mingxi Wu, Christopher M. Jermaine, Sanjay Ranka. "Statistical Change Detection for Multi-Dimensional Data," Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007), v.13, 2007, p. 667. |