Data Science

What is Data Science?

A definition from Wikipedia [1]:

“Data science, also known as data-driven science, is an interdisciplinary field about scientific processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD)”.

The indirect path between human and data through computer scientists, unlike the direct one in statistics domain, has given us the modern and emerging domain of Data Science. Data science aims to provide natural human-data interfaces where people can interact naturally with information using the concept of Open data (e.g. Drupal/DKAN),  Open Knowledge (e.g. The Open Knowledge Network), Open system (e.g. The Open Group), and Open source Software (e.g. Deep Learning package - Tensorflow, PyCasp) and Platform (e.g. CDH - Cloudera hadoop).

Data Science is an interdisciplinary field because it adopts techniques and theories from broad spectrum of fields in mathematics, statistics, operations research, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing. Data Science also applies to wider domains of sciences, also including finances, social sciences and humanities.

Data Science Challenges

Finding efficient methods to scale to Big data is a big challenge as silos of data are available in  our hands to work on due to the explosion of Internet of Things (IoT) and sensor deployment. To make the data usable, we need to identify it, curate it, and finally make it accessible or visualize it. Therefore, collection of data is not a burden in the modern time when sea of data is available but the overhead has now been shifted to storage and data desalination.

Data harnessing is based on theoretical foundation (mathematics, statistics), system foundation (data-centric algorithm and system), and data research. The challenge is for 50-year-old Computer Science to adapt to data science and foster collaboration with Google, Watson, and other data-centric organizations.

Ownership, privacy, and security is another challenge working with big data. The open data movement which promotes data that are freely available without restrictions may able to address this issue.

The transfer and sharing of big data among researchers pose other challenges. Globus for researchers is trying to provide efficient environment for researchers to transfer, share, and publish their research data.

There is also a need for more generic technology for application to solve the real world problem than the existing ones like Siri, Cortana, and IBM Watson so people in academia can embrace them and work together.

The Federal Big Data Research and Development Strategic Plan

Government agency research and public-private partnerships [2], together with the education and training of future data scientists, will enable applications that directly benefit society and the economy of the Nation. To derive the greatest benefits from the many, rich sources of Big Data, the Administration announced a “Big Data Research and Development Initiative” on March 29, 2012. Dr. John P. Holdren, Assistant to the President for Science and Technology and Director of the Office of Science and Technology Policy, stated that the initiative “promises to transform our ability to use Big Data for scientific discovery, environmental and biomedical research, education, and national security.”

The Federal Big Data Research and Development Strategic Plan [2] builds upon the promise and excitement of the myriad applications enabled by Big Data with the objective of guiding Federal agencies as they develop and expand their individual mission-driven programs and investments related to Big Data. The Plan is based on inputs from a series of Federal agency and public activities, and a shared vision:

They envision a Big Data innovation ecosystem in which the ability to analyze, extract information from, and make decisions and discoveries based upon large, diverse, and realtime datasets enables new capabilities for Federal agencies and the Nation at large; accelerates the process of scientific discovery and innovation; leads to new fields of research and new areas of inquiry that would otherwise be impossible; educates the next generation of 21st century scientists and engineers; and promotes new economic growth.

The Plan is built around seven strategies that represent key areas of importance for Big Data research and development (R&D). Priorities listed within each strategy highlight the intended outcomes that can be addressed by the missions and research funding of NITRD agencies. These include advancing human understanding in all branches of science, medicine, and security; ensuring the Nation’s continued leadership in research and development; and enhancing the Nation’s ability to address pressing societal and environmental issues facing the Nation and the world through research and development.

There is also a growing recognition among data scientists that access to relevant data is essential for building on previous results. Robust measures are needed to quantify uncertainty and capture context to ensure reproducibility of results; this will give decision makers the ability to validate the trustworthiness of the data and the products of analyses. Decision makers will require tools for parsing the relevant knowledge applicable to decisions, converting knowledge into possible action, and understanding the implications and impact of those actions.

There has been a groundswell of new Data Science programs offered at institutions that teach the necessary skills for dealing with Big Data. Many of these programs are at the Master’s level, but the number of programs at the undergraduate and Ph.D. levels is increasing. A core data science curriculum includes course material from computer science, statistics, ethics, social science, and policy. In addition to Data Science programs at undergraduate and graduate levels, programs are emerging to train students from other disciplines in the basics of data science. These disciplines cover the full range from science, engineering, biomedicine, clinical medicine, business, social science, humanities, law, and the arts. Consensus about the program content is beginning to emerge.

Big Data SSG agencies can play a significant role in helping define the needs and requirements for these programs. In conjunction with the National Research Council’s Committee on Applied and Theoretical Statistics (CATS), NSF brought together the community at a workshop entitled “Training Students to Extract Value from Big Data” to discuss how to educate and train students in order to increase the cadre of data scientists.

Efforts are needed to determine the core educational requirements of data scientists, and investments are needed to support the next generation of data scientists and increase the number of data-science faculty and researchers. As scientific research becomes richer in data, domain scientists need access to opportunities to further their data-science skills, including projects that foster collaborations with data scientists, data-science short courses, and initiatives to supplement training through seed grants, professional-development stipends, and fellowships. In addition, employees and managers in all sectors need access to training “boot camps,” professional-development workshops, and certificate programs to learn the relevance of Big Data to their organizations. More university courses on foundational topics and other short-term training modules are also necessary to help transform the broader workforce into data-enabled citizens. Data-science training should extend to all people through online courses, citizen-science projects, and K- 12 education. Research in data-science education should explore the notion of data literacy, curricular models for providing data literacy, and the data-science skills to be taught at various grade levels.

In particular, funding that allows more graduate students to engage in data science research at academic institutions will enable both the research needed to advance the field and the training needed to grow a cohort of core data scientists. NSF recently announced an NSF Research Traineeship (NRT) program designed to encourage the development and implementation of transformative and scalable models for STEM graduate education training. The NRT program includes a Traineeship Track dedicated to effective training of STEM graduate students in high-priority interdisciplinary research areas, through the use of a comprehensive traineeship model that is innovative, evidence-based, scalable, and aligned with changing workforce and research needs. Additionally, NSF is catalyzing the growth of data science infrastructure and data scientists by leveraging existing programs to incorporate data science training into their solicitations. The NIH is implementing strategies to train domain researchers to engage with Big Data. In May 2015, NIH announced the first round of Big Data to Knowledge (BD2K) Institutional Training Grant awards, which provide undergraduate and graduate students with integrated training in computer science, informatics, statistics, mathematics, and biomedical science.

CWRU’s Data Science Movement

Through the collaboration of its business and higher education members, the Business-Higher Education Forum (BHEF) [3] launched the National Higher Education and Workforce Initiative (HEWI) to create new undergraduate pathways in high-skill, high-demand fields such as data science and analytics. The Applied Data Science (ADS) concepts is the fundamental need in today's global business community that applies to verticals such as health, smart and connected cities, environment, natural hazards, energy, materials and advanced manufacturing, and cross cutting areas such as cyberinfrastructure, privacy, and security.

“With support from BHEF and our industry partners, our faculty members are preparing students with the data science skills that the business community requires and that will launch them into successful careers. Our minor in applied data science is providing a national model to help all of us in higher education better meet the workforce needs of the 21st century.” - Barbara R. Snyder, BHEF Chair, President, CWRU

In July 2013, Case Western Reserve and BHEF initiated a data science workforce strategy, and creation of the new undergraduate program began. Case Western Reserve’s forward-thinking data science curriculum will help to produce data scientists that can create efficient model to extrapolate and accurately predict product performance to rapidly expedite the time to commercialize a product.  

Launched in fall 2014, the ADS minor is available to all undergraduate students attending any Case Western Reserve school, including the Case School of Engineering (CSE), College of Arts and Sciences (CAS), Frances Payne Bolton School of Nursing (FPB), School of Medicine (SOM), and Weatherhead School of Management (Weatherhead). Students learn the essential elements of ADS, which include data management, distributed computing, informatics, ontology, query, and statistical analytics. They also learn how to conduct data analysis, from defining ADS questions to creating reproducible research. The current domain areas for minor concentration are business (including finance, marketing, and economics), engineering and physical sciences (including energy, manufacturing, and astronomy), and health (including translational and clinical). The curriculum is composed of five three-credit courses that advance the student through five levels: data science programming; data science programming; exploratory ADS;  undergraduate ADS research; and modeling and prognostics.

Students benefit from real-world application of their learning through large, industry-provided data sets and problems that markedly differ from the small and predictable datasets normally used for teaching. The university’s Solar Durability and Lifetime Extension (SDLE) Center, a world-class research center dedicated to lifetime and degradation science, also is a key resource for student research projects.

Case Western Reserve plans to develop a post-baccalaureate certificate program in data science for industry personnel interested in retraining. In addition, Case Western Reserve already provides a variety of data science opportunities at the graduate level, including a master’s degree track in health informatics at CSE, an MS or PhD in systems biology and bioinformatics at SOM, and master’s level nursing informatics courses at FPB.

SDLE Center at CWRU

Solar Durability and Lifetime Extension (SDLE) [4] Center aims to extend the durability and performance of the materials and technologies built into solar panels. “What we are seeing is that energy is a big topic,” said Roger French, a professor of materials science and director of the center. “They (materials and products) are a large investment. And if you can’t be comfortable with how long they will last, you will have a major hesitation about whether you should buy it.”

The purpose of the Energy Common Research Analytics and Data Lifecycle Environment (Energy CRADLE) is to create for engineering, and in particular lifetime science, the tools and protocols necessary to transform Big Data into information, which informs scientific knowledge to guide further analysis. Energy CRADLE is tightly focused on serving the needs of handling and sharing data among the SunFarm network researchers. Raw data collected from the SunFarms will go through data pre-processing and semantic annotation and stored in an NO-SQL Hadoop system. With domain knowledge, Energy CRADLE can manage the organization and orchestration of the data, making the inquiry of the data more efficient. The Energy CRADLE data integration environment has two features. First, it can push all the raw data collected from SunFarms on to a Hadoop Distributed File System (HDFS) and further map to HBase which is a distributed database. Secondly, through Thrift and REST servers, the user can use a visual front end to interact with data stored in HBase.

RCCI’s Offers for Data Science Curriculum

Research Computing and Cyberinfrastructure (RCCI) division in ITS, has already set up the infrastructure for big data analytics. It provides a hadoop cluster based on Cloudera to perform end-to-end Big Data workflows. The methods that scale to big data such as data analysis and visualization using cloudera hadoop and decision making through deep learning are of particular interest in data science. The CWRU High Performance Computing (HPC) cluster comprises widely used analytical open software including deep learning packages such as Torch, Caffe, Neuron, NumPy, etc having both multiprocessors and GPU capabilities. The HPC Guide to these Software and documentation on Deep Learning is going to help researchers to get started with the neural network and machine learning for advanced research. The scientific visualization toolkit like Visit and ParaView available in RCCI allows researchers to visualize high resolution images and videos in different data formats.

The inexpensive expandable RCCI Storage and Archival solution - high performance Panasas storage, high volume Fluid FS, and research archival based on Spectra Logic BlackPearl and LTO-7 tape Library, is currently helping researchers not only to store big data but use them and retrieve them easily for analysis in the later date. The cloud based solution such as Amazon AWS has also been investigated in case there is higher demand beyond RCCI’s capacity.

RCCI has created a separate private Hadoop Cluster using Dell R720 and Dell R720xd servers for the SDLE Center on top of the general Hadoop cluster that can meet the need for Professor Roger French’s data science class.  SDLE center is making good use of open source Statistical tool R with OpenMPI installed in HPC.

Globus for efficient file transfer and sharing has already been established in RCCI facility through Science DMZ project with 100gb network link to OARnet/Internet2 using data transfer node (dtn1). Researchers are already taking advantage of this infrastructure for their huge Genome data.

RCCI also offers Secure Research Environment (SRE), FISMA-based security program for sensitive content, to store and analyze such data inside the environment on top of promoting open data movement.

We have completed the investigation on container solution Singularity and installed it in HPC. Singularity will allow researchers to install the image for the Software of their choice, especially Deep Learning packages (e.g. TensorFlow) and other signal processing packages, which have fcompatibility issues in different platforms, without worrying about the HPC computing platform. For an example, researchers will be able to bootstrap TensorFlow image created in certain version of Ubuntu in RHEL environment in RCCI.

As per the requirement for SDLE lab during the meeting with SDLE team, Tensorflow-GPU has been made available to the team and the student is currently submitting the journal paper. Roger French, Director of SDLE, requested for other OS environment other than RHEL available in HPC and Singularity was suggested for him.

RCCI will also be reaching out to CWRU data science team to understand the current and future research and academic computing and services need.

Conclusion

In view of the growing demand of Data Science curriculum, RCCI has already put itself in a better position to meet the need of faculty, students, and researchers. By facilitating the faculty who is active in data science curriculum, we have demonstrated our ability to support Data Science Curriculum. Also, we have evaluated newer technologies like Singularity and will be evaluating more such as Drupal. We are vigilant about researchers future need, and we have positioned ourselves ahead of demand by perpetual investigation of new technologies in our test computing environment. Recently we are implementing the Data Science knowledge from GPU Technology Conference (GTC 2017).

References

[1] Data Science

[2] Federal Big Data Research and Development Strategic Plan

[3] Business-Higher Education Forum (BHEF)

[4] CWRU SDLE Center