Program 2015

Big Data Technologies 2015 - Classes program


The classes will take place at KN:E 301 on Wednesdays starting 9:15


The class title

Seminar title

The presentations template and description.


  1. Introduction, classes program, requirements. (Sedivy+Vondra)
    Introduction to Metacenter -

    1. How to create account in Metacenter.

    2. Basic setup.

    3. Data location.

    4. Write and run the word count.

  2. From Single CPU, to Big Data Hadoop (Tomas Vondra)

    1. Internet and the data growth

    2. The infrastructure development

    3. Multicore, Parallel computing, Grid, Hadoop 

    4. Numbers everyone should know

  3. MapReduce basics: Hadoop pipeline, MapReduce (TT)

    1. Start the Hadoop task from the command line.

    2. Data preparation

      1. What are the I/O formats and how they fit into the Mapreduce/Hadoop.

      2. Basic HDFS data operation, copying the data in and out of the HDFS.

    3. Start a task with data in HDFS or from distributed cache (node data).

    4. Start task with external dependencies, generic command-line options.

    5. Task administration: start, stop, kill, monitoring.

  4. Preprocessing + Wordcount (TG): Simple task to get acquainted with the basics.

    1. Input: raw text with the entity IDs, title and text.

    2. Filter out words  longer than MAX_LENGTH, remove non-ASCII chars, remove digits, etc.

    3. In a following MapReduce task use the output of wordcount to remove the stop words.

  5. Programming Mapreduce in Java (TT)

    1. The Mapreduce pipeline - Java modules, classes:

      1. InputFormat/OutputFormat - HDFS.

      2. RecordReader/RecordWriter - what is input partitioning and FileSplits.

      3. Map - what is the expected input and output.

      4. Reduce - what is the expected input and output.

    2. Explaining the transaction: Readers→Mappers→Reducers→Writers protokol.

    3. Key-Value pairs.

    4. What framework classes to inherit in Java for a project.

    5. How the Shuffle/Sort works:

      1. Partitioner class.

      2. GroupingComparator.

    6. Instantiation, initialization, and destruction of Mappers/Reducers.

    7. How to read Configuration data and distributed cache.

    8. How to produce multiple outputs (MultipleOutputFormat).

    9. How to execute several tasks in one run.

  6. Build a Wikipedia Index: (TT)

    1. Input: raw text with the entity IDs, title and text

    2. Output: indexed words

    3. Follow the next steps:

      1. Create a dictionary with calculated TF

      2. IDF.

      3. Document length and average document length

      4. Create an index:

        1. Use a word ID (from your vocabulary) as the key.

        2. Value is the list of document IDs with the count of the words in the document (SparseVectorFormat).

  7. Small NLP a IR tasks. (Sedivy)

    1. Introduction to IR task.

    2. TF-IDF.

    3. Boolean search.

    4. BM 25

  8. Implement simple search using the Hadoop BM25: (TG)

    1. Given a query, select the documents containing the query terms.

    2. Calculate the BM25 for selected documents and return a sorted list of their IDs.

  9. The basic NoSQL - (TG)

    1. databases will be reviewed

    2. HBase basics:

      1. read,

      2. write,

      3. update,

      4. delete  

    3. implementation in Java


  1. Faster implementation of 8. with HBase: (TG)

    1. Save the output of step 2. to HBase.

    2. Repeat the step 3. based on the HBase data.

    3. Implement simple command line query interface.


  1. Real time processing Streamed data - Storm, (TB)
    Twitter data processing, simple sentiment algorithm (Barton)

  2. Storm, (TB) real time processing practical examples Streamed data


  1. Wrap-up (Sedivy)
    from Internet to IoT

  2. Zapocet (Sedivy)