Under construction.
Big Data - Volume, Variety, Velocity
Parallel Computing and Cloud computing
Lambda architecture: Batch, Speed and Serving Layers
Batch processing
MR: a program model from functional programing;
Hadoop: MR implementation from Yahoo
YARN (MRv2 or next gen Hadoop)
Hive: data warehouse
Pig: high level data-flow language
Zookeeper: high-performance coordination service
Chukwa: data collection system
Precolator: web search index at Google
Ceffeine: search based on Precolator
Farmer/Panda: Google SOE (Search Engine Optimization)
Tez: Accelerating YARN Query Processing
Cascading: A data processing API and processing query planner
Scalding: An extension to Cascading at Twitter
Stream processing
◦Apache Thrift: scalable cross-language services from Facebook
◦Apache Flume: stream data collection
◦Storm: Stream processing from Twitter
◦Summingbird: a lib to write MR programs on MR at Twitter
◦S4: Stream processing from Yahoo
◦Scribe: server for stream data aggregating at Facebook
◦Data Freeway: data stream at Facebook
◦Puma: Stream processing from Facebook
◦Kafka: distributed messaging system at Linkedin, then Apache
◦Samza: stream processing from LinkedIn
◦Kinesis: real-time stream processing at Amazon
◦Dremel: Scalable, interactive ad-hoc query system at google
◦Apache Grill: Implementation of Google BigQuery
◦MillWheel: FT stream processing at Google
◦Apache Flink - Distributed stream and batch data processing
NoSQL-Not Only SQL database
◦Google Bigtable
◦Amazon Dynamo
◦Cassandra by Facebook
◦Hbase: like Bigtable
New SQL:
◦Google Spanner
Graph-based
Spark – Lightning-Fast Cluster Computing
Graphlab – Big ML on Graphs at UC Berkeley
BSP (Bulk Synchronous Parallel ) Model
Google Pregel - BSP based graph computing
Apache Giraph - open source for Pregel
Apach Hama - BSP based ML
Machine learning and some issues
Deep learning: Big model and big data
Large scale machine learning and trade-off
Large Scale Machine Learning
◦Mahout - Scalable ML on Hadoop
◦Jubatus – Distributed Online Real-time ML
◦Vowpal Wabbit – Fast Learning at Yahoo/MS
◦Trident ML and Storm Pattern: ML on Storm, YARN
◦Samoa: ML on S4, Storm
◦DMTK: Distributed Machine Learning Toolkit @ MSR
Issues in Scalable Distributed ML
◦Load balancing
◦Auto scaling
◦Job Scheduling
◦Workflow management
◦Fault tolerant
Data and Model Parallelism
Parameter Server Framework
Peer-to-Peer Framework
Distributed Deep Learning◦YahooLDA: Scalable parallel framework in latent variable models
◦DistBelief – Distributed deep learning on cluster
◦H2O – Distributed deep learning on Spark
◦Adam at MSR – distributed deep learning
◦DL4J – open source for DL on Hadoop and Spark
◦Petuum – distributed machine learning
◦SINGA – distributed deep learning
◦TensorFlow: Google large scale distributed DL
◦MXNET: heterogeneous distributed deep learning
◦Caffe on Spark@Yahoo/SparkNet w. Caffe@ Berkeley
◦CNTK @MSR
◦PaddlePaddle: deep learning platform@Baidu
◦Elephas: Keras & Spark
Distributed learning and optimization
◦Proximal splitting/Auxiliary coordinates;
◦Bundle (sub-gradient);
◦Shotgun: parallelized CDM (coordinate descent method)
◦Asynchronous SGD;
◦Hogwild/Dogwild.