KDD2015 Workshop on Learning from Small Sample Sizes

Rationale and Objectives

Rationale

There has been much work in the field of biostatistics and epidemiology on the analysis of longitudinal clinical data derived from clinical trials. The advent of the electronic medical records allows researchers to access data sets much larger, but also much noisier, than curated data from clinical trials. As the field of machine learning matures, the time is ripe for researchers to investigate new models applied to clinical data.

The fast growing field of information technology is changing our clinical systems now more than ever. A large amount of clinical data is now digitized, and clinical decisions are made more accurately and more efficiently thanks to the Electronic Medical Record (EMR) that is becoming universally available. In the United States, for instance, EMR adoption is experiencing rapid growth, in part driven by the recent regulatory mandates and government funding, especially the HITECH Act in the American Recovery and Reinvestment Act (ARRA). The motivation is that we will be able to extract key, actionable information from electronic data more robustly and use it meaningfully (i.e. to reach the Meaningful Use criteria), hence improving clinical, financial and operational outcomes.

Despite the increasing emphasis on collecting key information in structured fields of EMRs, much of the key information needed for measuring and driving process efficiencies still resides in unstructured (free) text, and often needs to be mined and extracted into structured form.

This is primarily because it is impossible to anticipate and precisely identify/define all the relevant information that would be useful for clinical, operational, and financial needs that may arise in the future. Given the constantly changing nature of medical knowledge in the form of evidence-based treatment guidelines, the definitions of key information elements also change rapidly over time. Thus, there is an increasing interest to look at learning approaches that can easily be adapted to these changes.

Purpose of the Workshop

Last year, we organized a workshop at ICML on the topic of learning from unstructured clinical texts. There were 70 registrants. The participants reported very high satisfaction with the format and content of the workshop. They also showed much interest in a workshop that goes beyond learning from clinical text, and includes research on learning from clinical data in general e.g. combining unstructured and structured data, using outcomes, diagnoses, IVD, image data etc..

The purpose of this cross-discipline workshop is to bring together machine learning, computational linguistics, and medical informatics researchers interested in problems and applications of learning from both structured and unstructured clinical data. The goal of the workshop will be to bridge the gap between the theory of machine learning, natural language processing and the applications and needs of the healthcare community. There will be exchange of ideas, identification of important and challenging applications and discovery of possible synergies. Ideally this will spur discussion and collaboration between the various disciplines and result in collaborative grant submissions. Thus far, research on clinical text intersects very little with research on clinical data, like time series. We expect this workshop to open new avenues for discussion amongst researchers working with structured clinical data and those working with clinical text. The emphasis will be on the mathematical and engineering aspects of learning and how it relates to practical medical problems.

Clinical datasets present some daunting and unique challenges. In clinical text the vocabularies tend to be very large and to make things worse, with little attention to standards and normalization. Time series of clinical measurements are often non-linear and very sparse for any given patient. Both types of data are noisy and data is missing (often not at random). Due to variations of patient states over time, and the lack of data sharing across hospitals and research institutions, data is often inconsistent across data sets. Thus, inferences need to be made over time by accounting for variations, inconsistencies and missing information. Another challenge to machine learning methods is the ability to adapt, often incrementally, to changing guidelines and clinical literature in such a way that they do not require complete retraining and relabeling of data.

We plan to address many of these topics both by invited and contributed talks. The workshop program will consist of presentations by invited speakers from both machine learning and medical informatics fields and by authors of extended abstracts submitted to the workshop. In addition, there will be a slot for a panel discussion to identify important problems, applications and synergies between the two scientific disciplines.

Page updated

Google Sites

Report abuse