ZBD2 2017/2018

Lecture 4 (Oct 24, 2017)



First assignment (deadline: Nov 14, 2017)

The goal is to find pairs of similar documents in a collection of short text documents, such as tweets. To evaluate the similarity of documents, they should be transformed into shingles and on shingle sets we use Jaccard similarity. For this purpose you will have to write a UDF that divides the string into shingles of a given length. Then prepare a SQL++ query that will find for each document a list of other documents ordered by their similarity  that are sufficiently similar. The required level of similarity should be given as a parameter. Additionally you are required to use one of the feed adapters so that the document base grows in response to external events. Hint: Except the last requirement, the task consists of a combination of tasks that were performed during the course. Look for useful links on the course page.

As a solution you should send a bash script that will download, install and run the asterix db in the current directory (it needs to work in lab) and load the sample data into it. A second script should periodically query the data retrieved by the feed adapter. For ease of testing, I suggest you use the "localfs" feed adapter. Email your solutions to sroka@mimuw.edu.pl before 23:59:59 Nov 14th. Make sure you attach your files.

Lecture 5 (Oct 31, 2017)

Balanced Practical Pregel Algorithms (and a note on minimal MapReduce algorithms)


Lecture 6 (Nov 7, 2017)



Lecture 7 (Nov 14, 2017)

Connected Components on Pregel by adapting Shiloach-Vishkin algorithm for PRAM


Second assignment (deadline: Dec 23, 2017)

A undirected graph is given in the form of a bunch of text documents. Each line describes a single vertex as:

vertex_id neighbour_id_1 neighbour_id_2 neighbour_id_3 … 

Vertex identifiers are integers. Example generator of such graphs can be found here. The goal is to find connected components of the given graph, i.e. to assign each vertex a unique representing vertex id in its connected component. All vertices of each connected component must be assigned the same vertex id. There are three options of this assignment.
  1. Use the naïve propagation algorithm from the Pregel seminal paper and the Single-Pivot optimization as described by Salihoglu & Widom.
  2. Use the adaptation of Shiloach-Vishking PRAM algorithm as described in the paper of Yan et al.
  3. Use any other interesting algorithm for CC on Pregel that you can find in the literature.
An example file dblp-2011.zip contains a dump of the DBLP social graph made in 2011. Its vertices are scientists. There is an edge between to of them, if they have coauthored an article.  Your solutions will be tested using this graph.

Please provide the source code of solutions classes and an instruction how to run it on the standard installation of Giraph as discussed in labs. Look for useful links on the course page. Email your solutions to stencel@mimuw.edu.pl before 23:59:59 Dec 23rd. Make sure you attach your files.

Lecture 10 (Dec 5, 2017)

Apache Flume and an introduction to process mining


  • http://flume.apache.org/FlumeUserGuide.html
  • Prepare an agent that listens on a specified port and at the same time generates logs and forward messages through HTTP. Then, prepare a second agent, which receives HTTP messages and forwards them to the console. Connect both agents. Try if you can connect to an agent of the person sitting at a computer next to you. Try to use selectors.

Lecture 12 (Dec 19, 2017)

Process mining in action:


Third assignment (deadline: Jan 23, 2017)

The goal of the assignment is to (1) process a stream of data, (2) take a reasonable sample of the stream that arrives during subsequent time periods, e.g., every month, and (3) mine this sample for a process model.

  1. Has to be implemented in apache flume with at least two agents running on different computers.
  2. Has to be implemented with algorithm from lecture 6, so we process for example only 10% of the stream, but the process model for every case is not effected by that.
  3. The 10% sample from the data from for subsequent time period is to be analyzed with ProM Lite running in a command line mode as we did during the last lab (https://dirksmetric.wordpress.com/2015/03/11/tutorial-automating-process-mining-with-proms-command-line-interface/).

As the input for the assignment please choose one of the data sets for process mining. Provide some kind of script which initializes the stream with this input. As the output we expect a list of figures with process models for the samples from subsequent time periods, e.g., from every month.

Lecture 14 (Jan 18, 2018)

Guest Speaker: Jarek Kuśmierek (Google)

Title: Data processing: Apache Beam + Dataflow

Abstract: Apache Beam jest platformą do budowania rozproszonych i skalowalnych zadań przetwarzających duże ilości danych - w modelu wsadowym i strumieniowym. Umożliwia programowanie przekształceń danych na wysokim poziomie - abstrahując od wielu szczegółów - w tym umożliwiając uruchomienie tego samego obliczenia na jednej porcji danych w trybie wsadowym i strumieniowo. Dataflow jest plaformą wewnątrz Google Cloud Platform, która umożliwia uruchamianie obliczeń Apache Beam w sposób w pełni zautomatyzowany - zdejmując z użytkownika wiele zadań. Podczas prezentacji opowiem o podstawowych koncepcjach modelu obliczeniowego Apache Beam oraz o tym jaka jest wartość dodana ze stosowania Dataflow.


Exam (Feb 1, 2018)

Only in Polish since no foreigners attended :(