Organizational stuff, introduction, Asterix DB
Pregel, Pregelix
The goal is to find pairs of similar documents in a collection of short text documents, such as tweets. To evaluate the similarity of documents, they should be transformed into shingles and on shingle sets we use Jaccard similarity. For this purpose you will have to write a UDF that divides the string into shingles of a given length. Then prepare a SQL++ query that will find for each document a list of other documents ordered by their similarity that are sufficiently similar. The required level of similarity should be given as a parameter. Additionally you are required to use one of the feed adapters so that the document base grows in response to external events. Hint: Except the last requirement, the task consists of a combination of tasks that were performed during the course. Look for useful links on the course page.
As a solution you should send a bash script that will download, install and run the asterix db in the current directory (it needs to work in lab) and load the sample data into it. A second script should periodically query the data retrieved by the feed adapter. For ease of testing, I suggest you use the "localfs" feed adapter. Email your solutions to sroka@mimuw.edu.pl before 23:59:59 Nov 14th. Make sure you attach your files.
Balanced Practical Pregel Algorithms (and a note on minimal MapReduce algorithms)
Connected Components on Pregel by adapting Shiloach-Vishkin algorithm for PRAM
Streams
A undirected graph is given in the form of a bunch of text documents. Each line describes a single vertex as:
vertex_id neighbour_id_1 neighbour_id_2 neighbour_id_3 …
Vertex identifiers are integers. Example generator of such graphs can be found here. The goal is to find connected components of the given graph, i.e. to assign each vertex a unique representing vertex id in its connected component. All vertices of each connected component must be assigned the same vertex id. There are three options of this assignment.
Use the naïve propagation algorithm from the Pregel seminal paper and the Single-Pivot optimization as described by Salihoglu & Widom.
Use the adaptation of Shiloach-Vishking PRAM algorithm as described in the paper of Yan et al.
Use any other interesting algorithm for CC on Pregel that you can find in the literature.
An example file dblp-2011.zip contains a dump of the DBLP social graph made in 2011. Its vertices are scientists. There is an edge between to of them, if they have coauthored an article. Your solutions will be tested using this graph.
Please provide the source code of solutions classes and an instruction how to run it on the standard installation of Giraph as discussed in labs. Look for useful links on the course page. Email your solutions to stencel@mimuw.edu.pl before 23:59:59 Dec 23rd. Make sure you attach your files.
Optimizing Graph Algorithms on Pregel-like Systems
Apache Flume and an introduction to process mining
Datalog, its semantics and evaluation
Process mining in action:
Lab:
The goal of the assignment is to (1) process a stream of data, (2) take a reasonable sample of the stream that arrives during subsequent time periods, e.g., every month, and (3) mine this sample for a process model.
As the input for the assignment please choose one of the data sets for process mining. Provide some kind of script which initializes the stream with this input. As the output we expect a list of figures with process models for the samples from subsequent time periods, e.g., from every month.
Distributed scalable execution of Datalog
Guest Speaker: Jarek Kuśmierek (Google)
Title: Data processing: Apache Beam + Dataflow
Abstract: Apache Beam jest platformą do budowania rozproszonych i skalowalnych zadań przetwarzających duże ilości danych - w modelu wsadowym i strumieniowym. Umożliwia programowanie przekształceń danych na wysokim poziomie - abstrahując od wielu szczegółów - w tym umożliwiając uruchomienie tego samego obliczenia na jednej porcji danych w trybie wsadowym i strumieniowo. Dataflow jest plaformą wewnątrz Google Cloud Platform, która umożliwia uruchamianie obliczeń Apache Beam w sposób w pełni zautomatyzowany - zdejmując z użytkownika wiele zadań. Podczas prezentacji opowiem o podstawowych koncepcjach modelu obliczeniowego Apache Beam oraz o tym jaka jest wartość dodana ze stosowania Dataflow.
SociaLite: Datalog Extensions for Efficient Social Network Analysis