--- 2016. Presented at the National Science Foundation (NSF) in late January for the NSF Data Science Seminar Series.
In a concise 25-minute presentation, Jordan made his case for why one should think of data science as the combination of computational and inferential thinking, noting that the most appealing challenge in Big Data for him is the potential for personalization. He started with the challenge that the core theories in computer science and statistics were developed separately and there is an oil and water problem to be surmounted. As an example, he noted that core statistical theory does not have a place for runtime and other computational resources while core computational theory does not have a place for statistical risk.....
The ASA Curriculum Guidelines for Undergraduate Programs in Statistical Science (PDF) states, “Institutions need to ensure students entering the work force or heading to graduate school have the appropriate capacity to ‘think with data’ and to pose and answer statistical questions.” The guidelines also note the increasing importance of data science. While the guidelines were explicitly silent about the first course, they do state the following:.....
--- 2016. At Stanford's first Women in Data Science Conference, engineers from industry and academia discuss personalized medicine, entertainment, marketing, cybersecurity and more.
Almost anywhere we turn, evidence of a data revolution abounds. That realization suffused the inaugural Women in Data Science Conference at Stanford. Sharing the opening stage with Drell was conference organizer Margot Gerritsen, associate professor of energy resources engineering at Stanford and director of the Institute for Computational and Mathematical Engineering. Gerritsen amplified Drell's remarks, saying: “Data science is a very rapidly growing field of increasing importance. So much research and business decisions are based on data. If we want to ask all of the right questions and analyze all aspects of a problem, we need diversity and multidisciplinary thinking.” Here are several insights that emerged from this daylong exploration of our unprecedented ability to harness the power of data....
Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment ) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the biases of batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a large fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the "Bless of Dimensionality". As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods.
Drawing on work by statisticians John Tukey, John Chambers, Bill Cleveland and Leo Breiman, the author presents a vision of data science based on the activities of people who are ‘learning from data’, and describes an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.
This paper develops a framework for testing for associations in a possibly high-dimensional linear model where the number of features/variables may far exceed the number of observational units. In this framework, the observations are split into two groups, where the first group is used to screen for a set of potentially relevant variables, whereas the second is used for inference over this reduced set of variables; the authors also develop strategies for leveraging information from the first part of the data at the inference step for greater accuracy.
They study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approximation). Focusing on the latter criterion, their first main result provides a general lower bound on any randomized method that sketches both the data matrix and vector in a least-squares problem; as a surprising consequence, the most widely used least-squares sketch is sub-optimal for solution approximation. They then present a new method known as the iterative Hessian sketch, and show that it can be used to obtain approximations to the original least-squares problem using a projection dimension proportional to the statistical complexity of the least-squares minimizer, and a logarithmic number of iterations.