Machine Learning for Big Data

Under construction.

Scalable, Distributed, (Deep) Machine Learning for Big Data

Big Data - Volume, Variety, Velocity

Parallel Computing and Cloud computing

Lambda architecture: Batch, Speed and Serving Layers

Batch processing

MR: a program model from functional programing;

Hadoop: MR implementation from Yahoo

YARN (MRv2 or next gen Hadoop)

Hive: data warehouse

Pig: high level data-flow language

Zookeeper: high-performance coordination service

Chukwa: data collection system

Precolator: web search index at Google

Ceffeine: search based on Precolator

Farmer/Panda: Google SOE (Search Engine Optimization)

Tez: Accelerating YARN Query Processing

Cascading: A data processing API and processing query planner

Scalding: An extension to Cascading at Twitter

Stream processing

◦Apache Thrift: scalable cross-language services from Facebook

◦Apache Flume: stream data collection

◦Storm: Stream processing from Twitter

◦Summingbird: a lib to write MR programs on MR at Twitter

◦S4: Stream processing from Yahoo

◦Scribe: server for stream data aggregating at Facebook

◦Data Freeway: data stream at Facebook

◦Puma: Stream processing from Facebook

◦Kafka: distributed messaging system at Linkedin, then Apache

◦Samza: stream processing from LinkedIn

◦Kinesis: real-time stream processing at Amazon

◦Dremel: Scalable, interactive ad-hoc query system at google

◦Apache Grill: Implementation of Google BigQuery

◦MillWheel: FT stream processing at Google

◦Apache Flink - Distributed stream and batch data processing

NoSQL-Not Only SQL database

◦Google Bigtable

◦Amazon Dynamo

◦Cassandra by Facebook

◦Hbase: like Bigtable

New SQL:

◦Google Spanner

Graph-based

Spark – Lightning-Fast Cluster Computing

Graphlab – Big ML on Graphs at UC Berkeley

BSP (Bulk Synchronous Parallel ) Model

Google Pregel - BSP based graph computing

Apache Giraph - open source for Pregel

Apach Hama - BSP based ML

Machine learning and some issues

Deep learning: Big model and big data

Large scale machine learning and trade-off

Large Scale Machine Learning

◦Mahout - Scalable ML on Hadoop

◦Jubatus – Distributed Online Real-time ML

◦Vowpal Wabbit – Fast Learning at Yahoo/MS

◦Trident ML and Storm Pattern: ML on Storm, YARN

◦Samoa: ML on S4, Storm

◦DMTK: Distributed Machine Learning Toolkit @ MSR

Issues in Scalable Distributed ML

◦Load balancing

◦Auto scaling

◦Job Scheduling

◦Workflow management

◦Fault tolerant

Data and Model Parallelism

Parameter Server Framework

Peer-to-Peer Framework

Distributed Deep Learning◦YahooLDA: Scalable parallel framework in latent variable models

◦DistBelief – Distributed deep learning on cluster

◦H2O – Distributed deep learning on Spark

◦Adam at MSR – distributed deep learning

◦DL4J – open source for DL on Hadoop and Spark

◦Petuum – distributed machine learning

◦SINGA – distributed deep learning

◦TensorFlow: Google large scale distributed DL

◦MXNET: heterogeneous distributed deep learning

◦Caffe on Spark@Yahoo/SparkNet w. Caffe@ Berkeley

◦CNTK @MSR

◦PaddlePaddle: deep learning platform@Baidu

◦Elephas: Keras & Spark

Distributed learning and optimization

◦Proximal splitting/Auxiliary coordinates;

◦Bundle (sub-gradient);

◦Shotgun: parallelized CDM (coordinate descent method)

◦Asynchronous SGD;

◦Hogwild/Dogwild.

Page updated

Google Sites

Report abuse