The 5th International Workshop
on Innovative Algorithms for Big Data
Abstract
- [Keynote Speech] Pankaj K. Agarwal (Duke University): "Flood Risk Analysis on Large Terrains"
With recent advances in terrain-mapping technologies such as Laser altimetry (LIDAR) and ground based laser scanning, billions of georeferenced points can be acquired within short periods of time. While acquiring and georeferencing the data has become extremely efficient, transforming the resulting massive amounts of heterogeneous data to useful information for different types of users and applications is still lagging behind, in large part because of the scarcity of scalable algorithms for terrain modeling and analysis that can handle big data sets acquired by different technologies and that can rapidly detect and predict changes in the model as the new data is acquired.
An important problem in terrain analysis is modeling how water flows across a terrain and creates floods by filling up depressions. In this talk we discuss a number of flood-risk related problems: Given a terrain T, represented as a triangulated xy-monotone surface with n vertices, a rain distribution R and a volume of rain V, determine which portions of T are flooded. We present efficient algorithms for flood-risk analysis under both single-flow-direction (SFD) as well as multi-flow-directions (MFD) models -- in the former water at a point can flow along one downward slope edge while in the latter it can flow along multiple downward slope edges; the latter more accurately represent flooding events but it is computational more challenging. We also discuss these problems when there is uncertainty in terrain data, which must be incorporated into the terrain analysis. By representing the uncertainty in the terrain data explicitly, we develop methods for flood risk analysis that properly incorporate terrain uncertainty when reporting what areas are at risk of flooding.
- [Invited Talk] Christina Boucher (University of Florida): "Developing Computational Methods to Analyze Foodborne Illness on a Large Scale"
The money and time needed to sequence a genome have decreased remarkably in the past decade. With this decrease has come an increase in the number and rate at which sequence data is collected for public sequencing projects. This led to the existence of GenomeTrakr, which is a large public effort to use genome sequencing for surveillance and detection of outbreaks of foodborne illnesses. This effort includes over 50,000 samples, spanning several species available through this initiative number that continues to rise as datasets are continually added. Unfortunately, analysis of this dataset has been limited due to its size. In this talk, I will describe our method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. Finally, I will show its capability of building a colored de Bruijn graph for 16,000 strains from GenomeTrakr in a manner that allows it to be updated. Lastly, I conclude by outlining some opportunities for further study in this area.
- [Invited Talk] Ton Coolen (King's College London): "Imaginary replica analysis of loopy regular random graphs"
Networks and graphs are effective and popular tools for representing and manipulating data on large and complex systems. Unfortunately, most of our analytical and computational tools for modeling large networks and graphs assume that they are locally treelike, in contrast to most networks in the real world. In this study we present an analytical approach for describing spectrally constrained maximum entropy ensembles of finitely connected regular loopy graphs, valid in the regime of weak loop-loop interactions. We derive an expression for the leading two orders of the expected eigenvalue spectrum, through the use of infinitely many replica indices taking imaginary values. We apply the method to models in which the spectral constraint reduces to a soft constraint on the number of triangles, which exhibit ‘shattering’ transitions to phases with extensively many disconnected cliques, to models with controlled numbers of triangles and squares, and to models where the spectral constraint reduces to a count of the number of adjacency matrix eigenvalues in a given interval. Our predictions are supported by MCMC simulations based on edge swaps with nontrivial acceptance probabilities.
[Reference]
Fabian Aguirre Lopez, Anthony CC Coolen: Imaginary replica analysis of loopy regular random graphs, arXiv:1907.06703.
- [Invited Talk] Khaled Elbassioni (Khalifa University of Science and Technology): ""Faster Algorithms for Large-scale Packing and Covering Semidefinite programs""
Packing and covering semidefinite programs (SDPs) appear in natural relaxations of many combinatorial optimization problems as well as a number of other applications. Recently, several techniques were proposed, that utilize the particular structure of this class of problems, to obtain more efficient algorithms than those offered by general SDP solvers. For certain applications, it maybe required to deal with SDP's with very large number of (e.g., exponentially or even infinitely many) constraints. In this talk, we give an overview of some of the techniques that can be used to solve this class of problems, focusing mostly on multiplicative weight updates and logarithmic potential methods.
- [Invited Talk] Chiou-Ting Candy Hsu (National Tsing Hua University): "Domain Adaptation and Generalization via Adversarial Learning"
Domain adaptation aims to learn a model on an unlabeled target domain by referring to a fully-labelled source domain; whereas domain generalization refers to multiple source domains and aims to adapt the learnt model to an unseen target domain. This talk will address issues of both problems and will focus on the application of image classification. To deal with the domain adaptation problem, we adopt the idea of domain adversarial training to minimize the source-target domain shift. Furthermore, in order to ensure the classifier is also target discriminative, we include a generative network to guide the classifier so as to push its decision boundaries away from high density area of target domain. On the other hands, to deal with the domain generalization problem, we propose to leverage the generalization capability by explicitly referring to image content and style from different source domains, and impose an additional constraint on the discriminator to further boost the class-discriminative capability.
- [Invited Talk] Ilan Newman (The University of Haifa): "Some recent results on sublinear algorithm for approximating centrality parameters in graphs"
We discuss several "centrality parameters" associated with vertices in undirected large graphs (social networks). E.g., average distance from a vertex), and other. We will make some comparisons between several centrality parameters. Then we show that such parameters cannot be efficiently approximated for bounded degree networks. However, under a distance oracle (rather just neighbourhood oracle), the complexity becomes much more efficient.
- [Invited Talk] Hidetoshi Nishimori (Tokyo Institute of Technology / Tohoku University): "Acceleration of quantum annealing by non-traditional driving of quantum effects"
Quantum annealing is a quantum-mechanical metaheuristic for combinatorial optimization problems. After a general review of the field, I will describe recent developments in several highly non-trivial protocols to accelerate quantum annealing, including non-stoquastic Hamiltonians, inhomogeneous field driving, reverse annealing, and mid-anneal pausing. These methods are expected, and some are shown, to lead to an exponential acceleration of computation compared to the conventional method. I will explain how such is possible and what we can expect in the near future in the field.
- [Invited Talk] Nicola Prezza (University of Pisa): "Compressed data structures for highly repetitive data"
The last decade has witnessed an explosion in the production of big repetitive datasets by modern technologies such as genome-analysis facilities (DNA sequencing), web-based services (e.g. Wikipedia, GitHub, YouTube) and social networks, to name a few. From the computational point of view, this represents a real data “deluge”: classic string-processing algorithms are not able to process this data in reasonable time and space. Not all hope is lost, however. All these big data sources share a common feature: they produce highly repetitive data. This means that their output is extremely compressible and opens up new exciting algorithmic challenges: to manipulate and analyze such data in compressed format, without first decompressing it. This talk is a survey of the state-of-the-art algorithmic techniques that have been developed in the last decades to address the problem of indexing compressed text: the FM-index, Lempel-Ziv and grammar indexes, the run-length Burrows-Wheeler transform index, and universally compressed indexes based on string attractors.