https://habrahabr.ru/company/oleg-bunin/blog/351308/ comparing druid / pinot / clickhouse
https://blog.codecentric.de/en/2017/03/distributed-stream-processing-frameworks-fast-big-data/
https://habrahabr.ru/company/oleg-bunin/blog/319052/
https://habrahabr.ru/post/317874/
http://statrgy.com/2015/05/20/best-sql-on-hadoop-tool/
http://bigdata.black/architecture/hadoop/sql-engines-hadoop-hive-spark-impala/
https://habrahabr.ru/company/smi2/blog/314558/ ClickHouse
http://docs.h2o.ai/steam/latest-stable/steam-docs/
https://dzone.com/articles/olap-for-big-data
http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/31.pdf Apache Accumulo
https://github.com/logv/snorkel
https://engineeringblog.yelp.com/2016/10/redshift-connector.html
https://habrahabr.ru/company/tcsbank/blog/310620/ comparison noSQL databases
https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
http://www.lewisgavin.co.uk/CDH-Docker/
http://www.lewisgavin.co.uk/Kudu-Spark/
Apache Sqoop open source solution to transfer data between HDFS and relational database systems. Apache Sqoop is designed to bulk-load data from a relational database to the HDFS (import) and to bulk-write data from the HDFS to a relational database (export).
The Apache Accumulo: sorted, distributed key/value store is a robust, scalable, high performance data storage system that features cell-based access control and customizable server-side processing. It is based on Google's BigTable design and is built on top of Apache Hadoop, ApacheZooKeeper, and Apache Thrift.
Apache Phoenix takes SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. All standard SQL query constructs are supported, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands.
Apache Drill Drill supports standard SQL.
JSON document model.
Columnar execution engine (the first ever to support complex data!)
Integration with Tableau
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
DRUID
Row-oriented:
Sequence file: binary key-value pairs; support splitting even if data is compressed
Avro: schema is encoded; block compression and splittable; schema evolution
Hybrid:
ORC (optimized row columnar) - evolution of RCFile
Columnar representation
Apache Kudu (mutable on disk) https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/kudu_bbuzz.pdf
Apache Parket (immutable on disk)
Apache Arrow (in memory)
Parquet - HADOOP COLUMNAR STORAGE FORMAT
http://arnon.me/2015/08/spark-parquet-s3/
https://www.alooma.com/blog/kafka-realtime-visualization
https://github.com/unnati-xyz/fifthel-2016-workshop DataScience pipeline
https://arrow.apache.org/ columnar database
Scalability scale-up (mor RAM, more CPU, more HDD) scale out: share nothing; commodity PC)
Mismatch in data model and query pattern
Write scalability
Read scalability
Data scalability
Data Access Patterns
Random Reads
Random Writes
Sequential Reads
Sequential Writes
Filter/Search by primary/secondary index
Sharding:
Hash based (pro: even distribution contra: no data locality)
Implemented: Casssandra, Redis, Dynamo
Consistent hashing
Range-based (pro: enables range scan and sorting contra: repartitioning balancing required)
Implemented in HBase
Data Model:
- key-value
- wide column: rowkey(sorted), column, timestamp -> value (Cassandra, HBase)
- document: (collection, key) -> document
- graph
SST- sorted string tables; append only; LSMT log structured merge trees
Replication:
how: sync (HBAse), async(Cassandra)
where:
multi master: need Paxos or inconsustemt
master-slave (master only accepts writes master is a bottleneck; slaves are read replicas
ACID:
- atomicity
- consistency
- isolation
- durability
CAP:
-consistency: all clients have same view on data
-availability: every request must return result
-partitioning: system has to continue working
https://www.linkedin.com/pulse/next-generation-analytics-apocalypse-when-spark-drill-john-de-goes
https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/
http://www.tpc.org/tpcx-bb/ benchmark for analytical processing
https://news.ycombinator.com/item?id=12166370 eventQL
https://hadoopecosystemtable.github.io/
https://news.ycombinator.com/item?id=11118274 Apache Arrow
https://dzone.com/articles/prescient-transforms-48000-data-sources-in-real-ti Apache NiFi
http://blog.premium-minds.com/akka-to-the-rescue/
Lambda architecture: the basic idea is that you run a streaming system alongside a batch system, both performing essentially the same calculation. The streaming system gives you low-latency, inaccurate results (either because of the use of an approximation algorithm, or because the streaming system itself does not provide correctness), and some time later a batch system rolls along and provides you with correct output.
Hive provides SQL/CQL does not provide interactive querying yet - it only runs batch processes on Hadoop.
HBase is a NoSQL key/value store which runs on top of HDFS. HBase operations run in real-time on its database rather than MapReduce
HBase is a column-orienetd datastore. It's all about performance
IMPALA
http://www.slideshare.net/cloudera/the-impala-cookbook-42530186
FLUME
https://habrahabr.ru/company/dca/blog/280386/
https://habrahabr.ru/company/dca/blog/281933/
https://www.linkedin.com/pulse/creating-data-pipeline-using-flume-kafka-spark-hive-mouzzam-hussain
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html
http://datametica.com/integration-of-spark-streaming-with-flume/
Classes
https://habrahabr.ru/company/e_contenta/blog/276661/
https://university.cloudera.com/content/cca175 Cloudera Exam
http://www.cloudera.com/content/www/en-us/training/certification/cca-spark.html
http://www.cloudera.com/content/www/en-us/training/certification/ccp-data-engineer.html
http://hortonworks.com/products/hortonworks-sandbox/#install
https://cloudacademy.com/amazon-web-services/amazon-machine-learning-course/
http://jonathanmace.github.io/bigdatasurvey/
https://zeef.com/?query=Apache%20Spark&in=all
https://news.ycombinator.com/item?id=10772141
KAFKA
Kafka is a publish-subscribe message system. It provides isolation between data producers and data consumers. It also allows intermediate storage and buffering of data to help with large velocities.
https://www.infoq.com/presentations/etl-streams
http://www.jesse-anderson.com/2016/09/solving-the-first-and-last-mile-problem-with-kafka-part-1/
http://www.jesse-anderson.com/2016/09/solving-the-first-and-last-mile-problem-with-kafka-part-2/
https://www.semanticscholar.org/search?q=Apache%20Kafka
http://www.confluent.io/blog/building-a-streaming-analytics-stack-with-apache-kafka-and-druid
https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/
http://www.tutorialspoint.com/apache_kafka
http://sookocheff.com/post/kafka/kafka-in-a-nutshell/
http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/
https://player.oreilly.com/videos/9781491931028
DRUID
https://news.ycombinator.com/item?id=11400681
https://blog.codecentric.de/en/2016/08/realtime-fast-data-analytics-druid/
http://www.confluent.io/blog/building-a-streaming-analytics-stack-with-apache-kafka-and-druid
https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani
https://github.com/implydata/pivot UI for Druid
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
http://imply.io/post/2016/07/05/exactly-once-streaming-ingestion.html
Online class
https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x
https://www.dataquest.io/mission/123/introduction-to-spark/
https://www.dataquest.io/section/big-data-tools
http://www.planetcassandra.org/getting-started-with-apache-cassandra-and-java/
http://www.infoq.com/presentations/parquet
https://www.youtube.com/watch?v=3UfZN59Nsk8 Google DataFlow
DB engines
https://kylin.apache.org/
http://druid.io/ DRUID column oriented distributed data store
http://habrahabr.ru/post/272267/ KUDU column oriented distributed data store
http://geode.incubator.apache.org/ Geocode
Apache Geode and Apache Ignite are more similar than they are different.
Apache Ignite, based off the commercial distribution Grid Gain http://habrahabr.ru/post/271475/
Apache Geode, based off the commercial distribution GemFire, has a long history in the market.
http://blog.brakmic.com/stream-processing-with-apache-flink/
https://www.youtube.com/watch?v=pqp7gLt_MFY
https://phdata.io/real-time-analytics-on-medical-device-data/
http://blog.acolyer.org/2015/04/27/musketeer-part-i-whats-the-best-data-processing-system/
http://thenewstack.io/building-streaming-data-hub-elasticsearch-kafka-cassandra/
https://news.ycombinator.com/item?id=10337154
https://courses.edx.org/courses/BerkeleyX/CS190.1x/1T2015/info
https://cloudacademy.com/amazon-web-services/courses/amazon-machine-learning
https://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/
http://danoyoung.blogspot.com/2015/10/and-bobs-your-uncle.html
http://dataintensive.net/ BOOK
https://databricks.com/blog/2015/06/16/zen-and-the-art-of-spark-maintenance-with-cassandra.html
http://www.spark.tc/real-time-application-performance-profiling-using-spark/
http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
http://habrahabr.ru/company/it-grad/blog/264549/ STORM KAFKA
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
http://thomaswdinsmore.com/2015/07/13/big-analytics-roundup-july-13-2015/
http://www.ibmbigdatahub.com/technology/hadoop-and-spark
https://news.ycombinator.com/item?id=8880259
Drill is a SQL engine and therefore in the same league as Apache Hive, Apache Tajo, or Cloudera's Impala.
Flink (and Spark) focus on use cases that exceed pure SQL (+ a few UDFs) such as Graph processing, Machine Learning, and very custom data flows.
In fact, the use-cases of Spark and Flink overlap a bit. However, the technology used under the hood is quite different. Flink shares a lot of similarities with relational DBMS. Data is serialized in byte buffers and processed a lot in binary representation. This also allows for fine-grained memory control. Flink uses a pipelined processing model and it has a cost-based optimizer that selects execution strategies and avoids expensive partitioning and sorting steps. Moreover, Flink features a special kind of iterations (delta-iterations) that can significantly reduce the amount of computations as iterations go on (the vertex-centric computing model of Pregel / Giraph is a special kind of that).