Big Data Technologies 2015 - Classes program
The classes will take place at KN:E 301 on Wednesdays starting 9:15
The class title
The presentations template and description.
Introduction, classes program, requirements. (Sedivy+Vondra)
Introduction to Metacenter -
How to create account in Metacenter.
Write and run the word count.
From Single CPU, to Big Data Hadoop (Tomas Vondra)
Internet and the data growth
The infrastructure development
Multicore, Parallel computing, Grid, Hadoop
Numbers everyone should know
MapReduce basics: Hadoop pipeline, MapReduce (TT)
Start the Hadoop task from the command line.
What are the I/O formats and how they fit into the Mapreduce/Hadoop.
Basic HDFS data operation, copying the data in and out of the HDFS.
Start a task with data in HDFS or from distributed cache (node data).
Start task with external dependencies, generic command-line options.
Task administration: start, stop, kill, monitoring.
Preprocessing + Wordcount (TG): Simple task to get acquainted with the basics.
Input: raw text with the entity IDs, title and text.
Filter out words longer than MAX_LENGTH, remove non-ASCII chars, remove digits, etc.
In a following MapReduce task use the output of wordcount to remove the stop words.
Programming Mapreduce in Java (TT)
The Mapreduce pipeline - Java modules, classes:
InputFormat/OutputFormat - HDFS.
RecordReader/RecordWriter - what is input partitioning and FileSplits.
Map - what is the expected input and output.
Reduce - what is the expected input and output.
Explaining the transaction: Readers→Mappers→Reducers→Writers protokol.
What framework classes to inherit in Java for a project.
How the Shuffle/Sort works:
Instantiation, initialization, and destruction of Mappers/Reducers.
How to read Configuration data and distributed cache.
How to produce multiple outputs (MultipleOutputFormat).
How to execute several tasks in one run.
Build a Wikipedia Index: (TT)
Input: raw text with the entity IDs, title and text
Output: indexed words
Follow the next steps:
Create a dictionary with calculated TF
Document length and average document length
Create an index:
Use a word ID (from your vocabulary) as the key.
Value is the list of document IDs with the count of the words in the document (SparseVectorFormat).
Small NLP a IR tasks. (Sedivy)
Introduction to IR task.
Implement simple search using the Hadoop BM25: (TG)
Given a query, select the documents containing the query terms.
Calculate the BM25 for selected documents and return a sorted list of their IDs.
The basic NoSQL - (TG)
databases will be reviewed
implementation in Java
Faster implementation of 8. with HBase: (TG)
Save the output of step 2. to HBase.
Repeat the step 3. based on the HBase data.
Implement simple command line query interface.
Real time processing Streamed data - Storm, (TB)
Twitter data processing, simple sentiment algorithm (Barton)
Storm, (TB) real time processing practical examples Streamed data
from Internet to IoT