The Promise of Big Data: Influence Maximization, Diffusion of
Innovation, and Information Trends in Social Networks
Divy Agrawal
University of California at Santa Barbara

With hundreds of millions of users worldwide, online social networks provide incredible opportunities for social connection, learning, political and social change, and individual entertainment and enhancement in a wide variety of forms. In light of these notable outcomes, understanding influence maximization, diffusion of innovation, and information trends over online social networks are critical research problems. Because many social interactions currently take place in online networks, we now have access to unprecedented amounts of information about social interactions. Prior to the advent of such online networks, investigations about social behavior required resource-intensive activities such as random trials, surveys, and manual data collection to gather even small data sets. Now, massive amounts of information about social networks and social interactions are recorded. This wealth of data (Big Data) can allow us to study social interactions on a scale and at a level of detail that has never before been possible.

We present an integrated approach to conduct data-driven modeling and analysis in online social networks focusing on three key problems: (1) Querying and analysis of online social network datasets; (2) Modeling and analysis of social networks; and (3) Analysis of social media and social interactions in the contemporary media environment. The overarching goals are to generate a greater understanding of social interactions in online networks through data analysis, to develop reliable and scalable models that can predict outcomes of these social processes, and ultimately to create applications that can shape the outcome of these processes.

We start by developing and refining models of information diffusion based on real-world data sets. We next address the problem of finding influential users in this data-driven framework. It is equally important to identify techniques that can slow or prevent the spread of misinformation, and hence algorithms are explored to address this question. A third interest is the process by which a social group forms opinions about an idea or product, and we therefore describe preliminary approaches to create models that accurately capture the opinion formation (i.e., influence) process in online social networks. While questions relating to the propagation of a single news item or idea are important, these information campaigns do not exist in isolation. Therefore, our proposed approach also addresses the interplay of the many information diffusion processes that take place simultaneously in a network and the relative importance of different topics or trends over multiple spatial and temporal resolutions. 
I/O-efficient Algorithms for Processing Big Terrain Data
Lars Arge

In the last decade new mapping technologies such as LiDAR have not only resulted in dramatical improvements in terrain data quality, but also in a spectacular increase in the amount of terrain data being acquired. Unfortunately, the increased dataset sizes has also revealed scalability problems in many applications. Often these problems are a result of the underlying algorithms not taking the hierarchical structure of memory systems into account (and especially the large difference in the access time of main memory and disk). In this talk we will discuss work on algorithms that minimize the number of accesses to disk - so-called I/O-efficient algorithms - and give examples of big terrain data applications (especially flood risk modeling applications) where the use of I/O-efficient algorithms has led to large practical runtime improvements.

Big Data Analysis with Statistical Methods
Lawrence Carin
Duke University

In this talk I will provide an overview of recent developments in statistical methods for big data. I will review relatively mature methods from the frequentist literature, and explain how closely related methods are now being developed in the Bayesian literature. I will show an example problem of Bayesian and non-Bayesian factorization of a 1M x 1M matrix, with computations performed efficiently on a PC.

Small Summaries for Big Data
Graham Cormode
AT&T Research

In dealing with big data, we often need to look at a small summary to get the big picture. Over recent years, many new techniques have been developed which allow important properties of large distributions to be extracted from compact and easy-to-build summaries. In this talk, I'll highlight some examples of different types of summarization: sampling, sketching, and special-purpose. Then I'll outline the road ahead for further development of such summaries.

Enabling Declarative Analytics over Big, Dynamic Information Networks
Amol Deshpande
University of Maryland, College Park

Over the last decade, information networks have become ubiquitous and widespread. These include social networks, communication networks, financial transaction networks, citation networks, disease transmission networks, and many more. Social contact graphs are expected to be available for analysis in near future, and can potentially be used to gain insights into various social phenomena as well as in disease outbreak and prevention. There is thus a growing need for data management systems that can support real-time ingest, storage, and querying over information networks, and complex analysis over them. In this talk, I will discuss some of our early work on building a distributed graph database system to support declarative analytics and continuous queries over very large, dynamic information networks.

The TokuFS Streaming File System
Martin Farach-Colton
Rutgers University & Tokutek

The TokuFS Streaming file system outperforms write-optimized file systems by an order of magnitude on microdata write workloads, and outperforms read-optimized file systems by an order of magnitude on read workloads. Microdata write workloads include creating and destroying many small files, performing small unaligned writes within large files, and updating metadata. TokuFS is implemented using fractal tree indexes, which are primarily used in databases. TokuFS employs block-level compression to reduce its disk usage.

Joint work with Michael Bender, John Esmet and Bradley Kuszmaul.

Making Sense at Scale with Algorithms, Machines and People
Michael Franklin
UC Berkeley
The AMPLab is a collaboration of computer scientists and domain experts focused on Big Data analytics. We are developing a new open source data analysis software stack integrating machine learning and data analytics at scale (Algorithms), cloud and cluster computing (Machines) and crowdsourcing (People) to make sense of massive data. Among our current efforts are: a new generation of scalable machine learning algorithms, data management tools for large-scale and heterogeneous datasets, datacenter-friendly programming models, hybrid human/machine query answering systems, and an improved computational infrastructure. We work with application partners to test out these ideas with real-world workloads and data sets. Current application partners focus on cancer genomics, real-time traffic sensing and prediction, urban planning, and Internet security.  AMPLab is a five year effort sponsored in part by an NSF CISE Expeditions in Computing award and by 18 leading technology companies including founding sponsors Amazon Web Services, Google and SAP.  In this talk I'll give an overview of the AMPLab structure, goals and some initial results.

Big Data Social
Johannes Gehrke

There are many Big Data Applications that require users to coordinate and communicate. Friends want to coordinate travel plans, students want to jointly enroll in the same set of courses, and busy professionals want to coordinate their schedules. These tasks are difficult to program using existing abstractions provided by database systems since they all require some type of coordination between users. However, this type of information flow is fundamentally incompatible with classical isolation in database transactions. I will argue that it is time to look beyond isolation towards principled and elegant abstractions for Big Data Social, and I will also describe a large-scale graph processing infrastructure for implementing graph algorithms for social and many other Big Data applications.
This is joint work with Gabriel Bender, Alan Demers, Nitin Gupta, Christoph Koch, Lucja Kot, Milos Nikolic, and Sudip Roy, Wenlei Xie, and Guozhang Wang.

A Holistic Look at Massive-Data Machine Learning: Seven Cross-cutting Techniques
Alexander Gray
Georgia Institute of Technology

We'll begin by looking at seven basic types of machine learning problems, such as classification and clustering, and the methods used to solve them, which include support vector machines, nearest-neighbors, principal component analysis, hierarchical clustering, kernel density estimation, Nadaraya-Watson regression, kernel conditional density estimation, Gaussian process regression, manifold learning, n-point correlation functions, minimum spanning trees, nonparametric Bayes classifiers, and others. I will then describe how we can scale these methods to work on massive datasets, despite their often quadratic or cubic scaling with the number of data, via seven different types of computational techniques. Finally, I'll describe the recently-announced first professional-grade machine learning system that incorporates the techniques described, called Skytree Server.

Understanding Data Sets Jointly 
Leonidas J. Guibas
Stanford University

The information contained across many data sets is often highly correlated. Such connections and correlations can arise because the data captured comes from the same or similar objects, or because of particular symmetries or other self-relations that the data source satisfies. We argue that in extracting knowledge from data in a given data set, we can do much better when we exploit the wider context provided by all the relationships between this data set and a "society" or "social network" or related data sets. We discuss mathematical and algorithmic issues on how to represent and compute relationships or mappings between data and how to analyze and leverage networks, small and large, of inter-related data sets. By analyzing data sets jointly we are able to benefit from the "wisdom of the collection" to deal with issues of missing or uncertain data, outliers, etc.

Faster Algorithms for Sparse Fourier Transform
Piotr Indyk

The Fast Fourier Transform (FFT) is one of the most fundamental numerical algorithms. It computes the Discrete Fourier Transform (DFT) of an n-dimensional signal in O(n log n) time. The algorithm plays a key role in many areas. In many applications (e.g., audio, image or video compression), most of the Fourier coefficients of a signal are ?small? or equal to zero, i.e., the output of the transform is (approximately) sparse. In this case, there are algorithms that enable computing the non-zero coefficients faster than the FFT. However, in practice, the exponents in the runtime of these algorithms and their complex structure have limited their applicability to only very sparse signals.

In this talk, I will describe a new set of algorithms for sparse Fourier Transform. Their key feature is simplicity, which leads to efficient running time with low overhead, both in theory and in practice. In particular, one of the algorithms achieves a runtime of
O(k log n), where k is the number of non-zero Fourier coefficients of the signal. This improves over the runtime of the FFT for any k = o(n).

Joint work with Haitham Hassanieh, Dina Katabi and Eric Price.

LSH-Preserving Transformations
Ravi Kumar

In this talk we present some recent results on locality sensitive hashing (LSH), a key algorithmic tool used in large data applications. Given a similarity function, an LSH is a hash function family such that the similarity of two elements can be captured by the probability with which their respective hash values concur when the hash function is picked from the family. We study the class of transformations under which the set of similarities that admit an LSH is closed; we obtain a tight characterization of this class. We then apply these results in the context of similarities between two sets (such as the widely-used Jaccard measure).

Understanding Climate Change: Opportunities and Challenges for Data Driven Research
Vipin Kumar
University of Minnesota

Climate change is the defining environmental challenge facing our planet. Multiple lines of evidence indicate that the planet has been warming over the past several decades, and if green house gas emissions from fossil fuel combustion and deforestation are not reduced, consequences can be dire. This has created an urgent societal need to consider options for climate change mitigation (e.g., by reduction in green house gas emissions) and adaptation (e.g., strategies and measures to reduce the adverse impacts of climate change such as floods, droughts, etc). Since the societal costs of mitigation and adaptation decisions are very large, it is critical that these decisions are adequately informed by science. Unfortunately, the predictive potential of physics-based models of the earth system is rather limited, and thus there remains considerable uncertainty regarding the social and environmental impacts of climate change.

This talk will discuss opportunities and challenges for computer scientists in addressing this great societal challenge of our times. A special focus is on new data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. These information-rich datasets offer huge potential for monitoring, understanding, and predicting the behavior of the Earth's ecosystem and for advancing the science of climate change. This talk will discuss some of the challenges in analyzing such data sets and our early research results.

Geometry and Approximation for Data in High-dimensions
Mauro Maggioni
Duke University

It has been observed that in many situations data in high-dimensions is intrinsically low-dimensional. We describe various approaches that take advantage of this low-intrinsic dimension to generate efficient representations of data, estimate the distribution of seen and unseen data, and discuss applications to anomaly detection, classification, compressive sampling, and modeling of dynamic data sets.

GPU-based Parallel Algorithms for High Dimensional Nearest Neighbor Computation
Dinesh Manocha
University of North Carolina at Chapel Hill 
Nearest Neighbor computations in huge databases is an important problem in many applications, including databases, multimedia, content-based image and video retrieval (CBIR), and machine learning. With recent explosion of web data, including documents, images, and videos, there is considerable need to develop efficient (sub-linear) techniques. Besides the scalability issues, we also need to deal with the curse of dimensionality since the visual descriptors usually have hundreds or even thousands of dimensions. In this talk, I will give an overview of some recent developments in hashing based methods, including locality sensitive hashing. Furthermore, I would present techniques that can exploit the computational power of commodity many-core GPUs for computation of hash tables and demonstrate their application to image-search and robot motion planning. In the end, I will discuss issues in developing fast many-core algorithms for spectral hashing and supervised methods.

Reverse Data Analytics: Answering Hard Questions on Big Data
Alexandra Meliou
University of Washington

Current trends have seen data grow larger, more intertwined, and more diverse, as an ever increasing number of users contributes to and uses it. This trend has given rise to the need to support richer data analysis tasks. Such tasks involve determining the causes of observations, finding and correcting the sources of error in query results, as well as modifying the data in order to make it conform to complex desirable properties. These tasks share a common premise: they essentially reverse data transformations, in order to achieve a desired target instance or target properties. This sets them apart from more traditional analytics tasks (e.g. pattern discovery, statistics, inference) that focus on forward-moving data flows. Reverse analytics tasks are harder to define and implement, because of the simple fact that the inverse of a function is not always a function.

In this talk, I will discuss three challenges that fall under this reverse paradigm: (a) implementing causal reasoning on top of data ("Why did sales drop last month?"), (b) tracing and correcting errors at their source (post-factum data cleaning), and (c) solving constrained problems over data ("How can I reduce operating costs by 10% without firing employees?").

Massive Data: What Can Go Wrong and How Statistics Can Help
Sarah Michalak
Statistical Sciences Group
Los Alamos National Laboratory

Massive data collected in a wide range of settings may suffer from a variety of issues that require appropriate treatment to enable sound scientific conclusions. For example, the data collection methods may not be ideal for certain scientific questions of interest. The data may also suffer from missing values and measurement error. With large data, there may be many questions of interest and resulting multiple comparisons can lead to false discovery. Statistical methods are available to address these issues in the non-massive regime and are crucial for the massive data regime.

Further, the reliability of the systems processing and storing the data and analysis results must be considered. Faults in these systems can impact research undertaken with massive data in two ways. First, large-scale systems used to process massive data can experience crashes that affect application runtimes. Second, the data itself or computational results may be subject to silent data corruption (SDC), which occurs when an undetected fault leads to incorrect data or an incorrect computational result being delivered to a user without an accompanying error or warning message. These issues underscore the need for applications that are resilient to faults that could cause both crashes and SDC. Further, the possibility of SDC should be accounted for in uncertainty quantifications.

This talk presents an overview of these issues, with an emphasis on the need for the development of methods for massive data that reflect collaborations among domain scientists, computer engineers, data scientists and statisticians and on system reliability, including SDC.

Big Data Challenges in Internet Advertising
Kamesh Munagala
Duke University

We will use advertising and marketing on the internet to showcase several challenges that arise in handling and mining massive data. We focus on designing and evaluating large-scale auctions that allocate ad slots to advertisers, and evaluating quality of ads and various mechanisms of targeting them to users. This brings up challenges related to learning, measurement and evaluation, and incentives. In particular, how can we quickly learn quality of different ads in a repeated auction setting? How do we design "optimal" auctions and targeting schemes that are computationally efficient? What incentive schemes should such auctions implement? How do we visualize the performance of various auctions and targeting mechanisms, and effectively compare them?

Spatial Big-Data
Shashi Shekhar
University of Minnesota

Increasingly, location-aware datasets are of a size, variety, and update rate that exceeds the capability of spatial computing technologies. We call such datasets, Spatial Big Data (SBD). SBD examples include trajectories of cell-phones and GPS devices, vehicle engine measurements, temporally detailed (TD) road maps, etc. SBD has the potential to transform society. A recent McKinsey Global Institute report estimates that personal location data could save consumers hundreds of billions of dollars annually by 2020 by helping vehicles avoid congestion via next-generation routing services such as eco-routing, which compares routes by fuel consumption or greenhouse gas (GHG) emissions rather than total distance or travel-time.

However, the SBD-based next-generation routing services pose several challenges for current routing techniques. First, SBD requires a change in frame of reference, moving from a global snapshot perspective to the perspective of an individual object traveling through a transportation network. Second, SBD may identify a large set of solutions, e.g., one route each for thousands of possible start-times in a week, significantly increasing computational costs. Third, SBD challenges the assumption that a single algorithm utilizing a specific dataset is appropriate for all situations. The tremendous diversity of SBD sources substantially increases the diversity of solution methods. For example, methods for determining fuel efficient routes leveraging engine measurement and GPS track datasets may be quite different from algorithms used to identify minimal travel-time routes exploiting temporally detailed roadmaps. Newer algorithms may emerge as new SBD becomes available, creating the need for a flexible architecture to rapidly integrate new datasets and associated algorithms. 

Big Data Fusion
Divesh Srivastava
AT&T Labs-Research

Big data analytics typically requires integrating data from a large number of data sources. This task is particularly challenging when the data sources are inconsistent, exhibit significant differences in quality, and are not independent of each other, all of which seem to be true in many domains. We present some recent data fusion algorithms that have been proposed to resolve inconsistency between data sources, and discuss their effectiveness and efficiency in practice.

Computing Statistical Summaries over Massive Distributed Data
Ke Yi
Facing today’s big data challenge, it is often impossible and unnecessary to examine the entire data set for many analytical tasks. What is more useful is various concise statistical summaries extracted from the underlying data that capture certain data characteristics. In this talk, I will present simple and efficient algorithms for computing some classical statistical summaries over distributed data, in particular the frequency estimation problem and quantiles. These algorithms can be easily implemented in distributed systems like MapReduce, Dremel, and sensor networks.