BigCHat: KDD 2014 Workshop on Connected Health at Big Data Era

The availability of big data and the emergence of network science as an area of inquiry has been changing the landscape how we decipher our lives, our social interactions, and our day-to-day activities. This well-connected world has proposed novel requirements on transforming healthcare from reactive and hospital-centered, to preventive, proactive, evidence-based, person-centered and focused on well-being rather than ailment recovery. Various types of data are involved in this broader context of healthcare:

  • Clinical data, mainly the patient records from clinical institutions, such as medical imaging, patient electronic health records, clinical trial data, etc.
  • Genotype data, basically the genetic makeups of the individuals, such as DNA and protein.
  •  Social media data, which is the information the individuals posted on online social platforms such as Facebook, Twitter, PatientsLikeMe, etc
  • Environmental sensory data, which are the information sampled from the surrounding environment where the individuals are living in, such as air pollution and humidity information
  • Behavioral and sentiment data, which could be the data recorded by the wearable devices on patient’s activities
  • Mobile data, which are sampled from individuals’ mobile devices

Integrating all these different kinds of information to make people healthier bears huge potential, but is also is a problem of vital importance and requires a lot of efforts from different parties where data miners play a major role. This makes this workshop highly relevant to KDD, the premier conference of data mining. Moreover, last year the National Science Foundation of United States set up a novel program on smart and connected health, which makes this workshop an in-time event for people to share their opinions and experiences on this topic.


Aug. 24, 2014 2:00pm-6:00pm at Bloomberg. See the KDD conference program here


 Time Type Presentation
 2:00-2:45 Invited Talk Shahram Ebadollahi: Enabling the healthcare eco-system with knowledge- and data-driven evidence and insights: Health IT in the era of Big Data
 2:45-3:30 Invited Talk GQ Zhang: Quality Assurance of Biomedical Ontologies: A Big Data Approach
 3:30-3:40 Paper Presentation Prediction of Biomedical Events via Time Intervals Mining (paper)
 3:40-3:50 Paper Presentation The Current Status of the Health Network in China: A Real World Case Study with 106,021 Hospitals (paper)
 3:50-4:00 Break 
 4:00-4:45 Invited Talk Joydeep Ghosh: Towards High-throughput Phenotype Generation via Joint Tensor Factorization
 4:45-5:30 Invited Talk Henry Kautz: Mining Public Health Information from Social Media
 5:30-5:40 Paper Presentation CAESAR-ALE: An Active Learning Enhancement for Conditions Severity Classification (paper)
 5:40-5:50 Paper Presentation Can Your Smartphone Reveal You Are Depressed? (paper)
 5:50-6:00 Paper Presentation Semantic Considerations in Time-Interval Mining (paper)

Topics of Interests

The topics of this workshop include, but not limit to, the following:

  • Integration and matching of different data sources
  • Quality assessment and improvement of different data
  • Disease modeling and early intervention
  • Data-drive methods for personalized medicine
  • Care coordination and pathway analysis
  • Behavioral modeling and sentiment analysis
  • Mobile health
  • Social media and public health
  • Comprehensive risk prediction
  • Community based elder care
  • Large scale and longitudinal analysis of multi-faceted information
  • Visual analytics and interactive computation

Program Committee Chairs

  • Fei Wang. Research Staff Member. IBM T. J. Watson Research Center.
  • Hanghang Tong. Assistant Professor. Department of Computer Science. City College. City University of New York.
  • Munmun De Choudhury. Assistant Professor. School of Interactive Computing. Georgia Institute of Technology.
  • Zoran Obradovic. Laura H. Carnell Professor. Computer and Information Sciences Department, Temple University.

Publicity Chair

  • Xiang Wang. Research Scientist. IBM T. J. Watson Research Center.
Invited Speakers

 Shahram Ebadollahi photo

Shahram Ebadollahi: Dr. Shahram Ebadollahi is the Vice President, Health Informatics Research and the Chief Science Officer, IBM Healthcare. In his capacity as the head of Health Informatics Research he has global responsibility for the direction of IBM Research in the area of healthcare research and oversees a multi-disciplinary team of scientists across IBM worldwide research laboratories. He and his team have conducted research in the broad area of health informatics with specific focus on Computational Healthcare, which aims at applying data-driven analytics and big data approaches to the domain of healthcare.
In his capacity as IBM’s Chief Science Officer, he has responsibility across IBM brands and offerings for the definition and setting of technical strategy and execution of IBM healthcare offerings and innovation in support of those offerings.
Dr. Ebadollahi’s work has enabled software and service offerings by IBM in the areas of Smarter Care and Real World Evidence with applications to healthcare payers, providers and pharmaceutical companies.
Dr. Ebadollahi received his PhD and MS degrees in Electrical Engineering from Columbia University before joining IBM Research. He has extensively published in the areas of analytics in general and healthcare analytics in particular. His work has been reported in various articles in the media. He has also served as adjunct faculty at Columbia University in New York and has advised and overseen the research of doctoral student in the areas of multimedia and medical imaging.

Title: Enabling the healthcare eco-system with knowledge- and data-driven evidence and insights: Health IT in the era of Big Data

Abstract: Improving outcomes, reducing costs and enhancing patient experience are the three aims of the healthcare system in the US. Information technology has an important role to play enabling organizations achieve these goals by providing timely and personalized evidence and insights. Such evidence and insights can be derived from large repositories of published scientific research as well as through mining and learning from observational data that are becoming increasingly available through the adoption of electronic health records enhanced by patient generated data.
I will share with the audience examples of knowledge-driven and data-driven technologies and their applications to payer, provider, and pharmaceutical organizations. These technologies span advanced Q&A, predictive modeling, cohort identification, risk stratification, care planning and visual analytics for decision support.


GQ Zhang:
Dr. Zhang is Division Chief of Medical Informatics, Co-Director of Biomedical Research Information Management of the NCATS-funded Case Western CTSA, and PI of the NHLBI-funded National Sleep Research Resource Center. His research spans large-scale, multi-center data integration, ontological engineering, query interface design and information retrieval. Dr. Zhang has led a group of faculty, student and developers which has deployed a half dozen tools for data capturing, data management and data integration, effectively bringing cutting-edge computer science and informatics methodology to address data challenges.

Title:  Quality Assurance of Biomedical Ontologies: A Big Data Approach

Abstract: Ontologies are shared conceptualizations of a domain represented in a formal language. They represent not only the concepts used in scientific work, but just as importantly, the relationships between the concepts.Ontologies have become the "brain" of many data management systems in biomedicine. They have been used, for example, to handle terminological heterogeneity, facilitate system interoperability and integration, enable knowledge discovery, and manage biomedical big data.

However, ontological systems are often incomplete, under-specified, and non-static,for reasons such as the evolving state of knowledge in a domain, the involvement of manual curation work, and the progressive nature of knowledge engineering itself.  New applications call for new ontologies  or expansion and enhancement of existing ones. Many additional factors, such as merging or reusing existing ontologies and porting to a common representation framework, may introduce inconsistencies and unintended artifacts. Thus Ontology Quality Assurance (OQA) is an indispensable part of knowledge engineering lifecycle. 

This talk presents an exemplar "Big Data" approach to OQA, using MapReduce to extract non-lattice fragments in ontological hierarchies.We are interested in non-lattice fragments because they are often indicative of structural anomalies in ontological systems and, as such, represent possible areas of focus for subsequent quality assurance work. However, extracting the non-lattice fragments in large ontological systems is computationally expensive if not prohibitive, using a traditional sequential approach. In this talk I will present a general MapReduce pipeline, called MaPLE (MapReduce Pipeline for Lattice-based Evaluation) for extracting non-lattice fragments in large partially ordered sets and demonstrate its applicability in ontology quality assurance. Using MaPLE in a 30-node Hadoop local cloud, all non-lattice fragments in 8 SNOMED CT versions from 2009 to 2014 have been extracted,  with an average total computing time of less than 3 hours per version (instead of 3 months using standard desktop machines). With dramatically reduced time,  MaPLE makes it feasible not only to perform exhaustive structural analysis of large ontological hierarchies,  but also to systematically track structural changes between versions. Preliminary change analysis  shows that the average change rates on the non-lattice pairs are up to 38.6 times higher than the change rates of the background structure (concept nodes).  This demonstrates that fragments around non-lattice pairs exhibit significantly higher  rates of change in the process of ontological evolution.

Joydeep Ghosh: Dr. Ghosh is currently the Schlumberger Centennial Chair Professor of Electrical and Computer Engineering at the University of Texas, Austin. He joined the UT-Austin faculty in 1988 after being educated at, (B. Tech '83) and The University of Southern California (Ph.D’88). He is the founder-director of IDEAL (Intelligent Data Exploration and Analysis Lab) and a Fellow of the IEEE. Dr. Ghosh has taught graduate courses on data mining and web analytics every year to both UT students and to industry, for over a decade. He was voted as "Best Professor" in the Software Engineering Executive Education Program at UT. Dr. Ghosh's research interests lie primarily in data mining and web mining, predictive modeling / predictive analytics, machine learning approaches such as adaptive multi-learner systems, and their applications to a wide variety of complex real-world problems. He has published more than 300 refereed papers and 50 book chapters, and co-edited over 20 books.
He has received 14 Best Paper Awards over the years. Dr. Ghosh hasalso  served as a co-founder, consultant or advisor to successful startups (Stadia Marketing, Neonyoyo and Knowledge Discovery One) and as a consultant to large corporations such as IBM, Motorola and Vinson & Elkins.

Title: Towards High-throughput  Phenotype Generation via Joint Tensor Factorization

The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts such as “phenotypes” that clinical researchers are familiar with and can use. Current approaches to determining and vetting phenotypes require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization.

In this talk, I will describe how certain types of collective, sparse, nonnegative tensor factorization methods can simultaneously yield multiple phenotype candidates with virtually no human supervision. Phenotype candidates correspond to tensor factors that automatically reveal patient clusters on specific diagnoses and medications. I will highlight the promise of data mining based high-throughput phenotyping, as well as key challenges in integrating such approaches with existing domain knowledge to produce even more readily acceptable and actionable results.

Henry Kautz: Dr. Kautz is Chair of the Department of Computer Science and Director of the Institute for Data Science at the University of Rochester. He performs research in social media, machine learning, pervasive computing, search algorithms, and assistive technology. His academic degrees are an A.B. in mathematics from Cornell University, an M.A. in Creative Writing from the Johns Hopkins University, an M.Sc. in Computer Science from the University of Toronto, and a Ph.D. in computer science from the University of Rochester. He was a researcher and department head at Bell Labs and AT&T Laboratories until becoming a Professor in the Department of Computer Science and Engineering of the University of Washington in 2000. He joined University of Rochester in 2006. He was President (2010-2012) of the Association for the Advancement of Artificial Intelligence (AAAI), and is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), a Fellow of the American Association for the Advancement of Science (AAAS), a Fellow of the Association for Computing Machinery (ACM), and a recipient of the IJCAI Computers and Thought Award.


Title: Mining Public Health Information from Social Media

Abstract: Many companies are data mining social media for marketing purposes.  We have begun data mining 
Twitter and similar media for public health purposes. People posting from their cell phones are in effect a mobile organic sensor network.  The "data exhaust" users generate from their posts can be used for purposes such as tracking influenza and determining the impact of environmental factors on public health. Unlike traditional methods for gaining public health information, social media data mining is real time and can be used to make predictions about particular individuals.  We demonstrate how statistical language models can be used to identify rare tweets about disease symptoms with surprisingly high precision and recall.

Key Dates

  • Paper Submission Deadline: June 23rd, Monday, 11:59PM PST
  • Acceptance Notification: July 8th
  • Camera Ready Copy Date: July 31st
  • Workshop Date: August 24th


    Manuscripts should be in English and must not exceed 8 pages using the standard ACM format found at in PDF format. Papers should be submitted from website

Program Committee 

  • Zhengxing Huang. Associate Professor. Zhejiang University. China.
  • Bo Jin. Associate Professor. Dalian University of Technology. China.
  • Jiming Liu. Chair Professor. Hongkong Baptist University. Hong Kong, China.
  • Robert Moskovitch. Postdoctoral Research Scientist. Columbia University. USA.
  • Kenney Ng. Research Staff Member. IBM T. J. Watson Research Center. USA.
  • Chandan Reddy. Associate Professor. Wayne State University. USA.
  • Stein Olav Skrøvseth. Senior Researcher. Norwegian Centre for Integrated Care and Telemedicine. Norway.
  • Gregor Stiglic. Associate Professor. Maribor University. Slovenia.
  • Vincent S. Tseng. Distinguished Professor. National Cheng Kung University. Taiwan.
  • Jiayu Zhou. Senior Research Scientist. Samsung Research North America. USA.

Related Events

Wang Fei,
Aug 22, 2014, 6:53 AM
Wang Fei,
Aug 22, 2014, 6:53 AM
Wang Fei,
Aug 22, 2014, 6:53 AM