flink storm drill kafka flume druid EventQL

https://habrahabr.ru/company/oleg-bunin/blog/351308/ comparing druid / pinot / clickhouse

https://blog.codecentric.de/en/2017/03/distributed-stream-processing-frameworks-fast-big-data/

https://habrahabr.ru/company/oleg-bunin/blog/319052/

https://habrahabr.ru/post/317874/

http://statrgy.com/2015/05/20/best-sql-on-hadoop-tool/

http://bigdata.black/architecture/hadoop/sql-engines-hadoop-hive-spark-impala/

https://habrahabr.ru/company/smi2/blog/314558/ ClickHouse

https://imply.io/

http://docs.h2o.ai/steam/latest-stable/steam-docs/

https://dzone.com/articles/olap-for-big-data

http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/31.pdf Apache Accumulo

https://github.com/logv/snorkel

https://github.com/logv/sybil

https://engineeringblog.yelp.com/2016/10/redshift-connector.html

https://habrahabr.ru/company/tcsbank/blog/310620/ comparison noSQL databases

https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/

http://www.lewisgavin.co.uk/CDH-Docker/

http://www.lewisgavin.co.uk/Kudu-Spark/

Apache Sqoop open source solution to transfer data between HDFS and relational database systems. Apache Sqoop is designed to bulk-load data from a relational database to the HDFS (import) and to bulk-write data from the HDFS to a relational database (export).

The Apache Accumulo: sorted, distributed key/value store is a robust, scalable, high performance data storage system that features cell-based access control and customizable server-side processing. It is based on Google's BigTable design and is built on top of Apache Hadoop, ApacheZooKeeper, and Apache Thrift.

Apache Phoenix takes SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. All standard SQL query constructs are supported, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands.

Apache Drill Drill supports standard SQL.

JSON document model.

Columnar execution engine (the first ever to support complex data!)

Integration with Tableau

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

DRUID

http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015

Row-oriented:

Sequence file: binary key-value pairs; support splitting even if data is compressed

Avro: schema is encoded; block compression and splittable; schema evolution

Hybrid:

ORC (optimized row columnar) - evolution of RCFile

Columnar representation

http://www.slideshare.net/HadoopSummit/the-columnar-era-leveraging-parquet-arrow-and-kudu-for-highperformance-analytics

Apache Kudu (mutable on disk) https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/kudu_bbuzz.pdf

Apache Parket (immutable on disk)

Apache Arrow (in memory)

Parquet - HADOOP COLUMNAR STORAGE FORMAT

https://parquet.apache.org/

http://arnon.me/2015/08/spark-parquet-s3/

https://www.alooma.com/blog/kafka-realtime-visualization

https://medium.baqend.com/real-time-stream-processors-a-survey-and-decision-guidance-6d248f692056#.dyrpqj3oo

http://www.confluent.io/blog/apache-flink-and-apache-kafka-streams-a-comparison-and-guideline-for-users

https://github.com/unnati-xyz/fifthel-2016-workshop DataScience pipeline

https://arrow.apache.org/ columnar database

Scalability scale-up (mor RAM, more CPU, more HDD) scale out: share nothing; commodity PC)

Mismatch in data model and query pattern

Write scalability

Read scalability

Data scalability

Data Access Patterns

Random Reads
Random Writes
Sequential Reads
Sequential Writes
Filter/Search by primary/secondary index

Sharding:

Hash based (pro: even distribution contra: no data locality)

Implemented: Casssandra, Redis, Dynamo

Consistent hashing

Range-based (pro: enables range scan and sorting contra: repartitioning balancing required)

Implemented in HBase

Data Model:

- key-value

- wide column: rowkey(sorted), column, timestamp -> value (Cassandra, HBase)

- document: (collection, key) -> document

- graph

SST- sorted string tables; append only; LSMT log structured merge trees

Replication:

how: sync (HBAse), async(Cassandra)

where:

multi master: need Paxos or inconsustemt

master-slave (master only accepts writes master is a bottleneck; slaves are read replicas

ACID:

- atomicity

- consistency

- isolation

- durability

CAP:

-consistency: all clients have same view on data

-availability: every request must return result

-partitioning: system has to continue working

https://www.linkedin.com/pulse/next-generation-analytics-apocalypse-when-spark-drill-john-de-goes

https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/

http://www.tpc.org/tpcx-bb/ benchmark for analytical processing

https://news.ycombinator.com/item?id=12166370 eventQL

http://de.slideshare.net/felixgessert/nosql-data-stores-in-research-and-practice-icde-2016-tutorial-extended-version

https://hadoopecosystemtable.github.io/

https://news.ycombinator.com/item?id=11118274 Apache Arrow

https://dzone.com/articles/prescient-transforms-48000-data-sources-in-real-ti Apache NiFi

http://blog.premium-minds.com/akka-to-the-rescue/

Lambda architecture: the basic idea is that you run a streaming system alongside a batch system, both performing essentially the same calculation. The streaming system gives you low-latency, inaccurate results (either because of the use of an approximation algorithm, or because the streaming system itself does not provide correctness), and some time later a batch system rolls along and provides you with correct output.

Hive provides SQL/CQL does not provide interactive querying yet - it only runs batch processes on Hadoop.

HBase is a NoSQL key/value store which runs on top of HDFS. HBase operations run in real-time on its database rather than MapReduce

HBase is a column-orienetd datastore. It's all about performance

IMPALA

http://www.slideshare.net/cloudera/the-impala-cookbook-42530186

FLUME

https://habrahabr.ru/company/dca/blog/280386/

https://habrahabr.ru/company/dca/blog/281933/

https://www.linkedin.com/pulse/creating-data-pipeline-using-flume-kafka-spark-hive-mouzzam-hussain

https://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html

http://datametica.com/integration-of-spark-streaming-with-flume/

Classes

https://habrahabr.ru/company/e_contenta/blog/276661/

https://university.cloudera.com/content/cca175 Cloudera Exam

http://www.cloudera.com/content/www/en-us/training/certification/cca-spark.html

http://www.cloudera.com/content/www/en-us/training/certification/ccp-data-engineer.html

http://hortonworks.com/products/hortonworks-sandbox/#install

https://cloudacademy.com/amazon-web-services/amazon-machine-learning-course/

http://jonathanmace.github.io/bigdatasurvey/

https://zeef.com/?query=Apache%20Spark&in=all

https://news.ycombinator.com/item?id=10772141

KAFKA

Kafka is a publish-subscribe message system. It provides isolation between data producers and data consumers. It also allows intermediate storage and buffering of data to help with large velocities.

https://www.infoq.com/presentations/etl-streams

http://www.jesse-anderson.com/2016/09/solving-the-first-and-last-mile-problem-with-kafka-part-1/

http://www.jesse-anderson.com/2016/09/solving-the-first-and-last-mile-problem-with-kafka-part-2/

https://www.semanticscholar.org/search?q=Apache%20Kafka

http://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams

http://www.confluent.io/blog/building-a-streaming-analytics-stack-with-apache-kafka-and-druid

https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/

http://www.tutorialspoint.com/apache_kafka

http://sookocheff.com/post/kafka/kafka-in-a-nutshell/

http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/

https://player.oreilly.com/videos/9781491931028

DRUID

https://news.ycombinator.com/item?id=11400681

https://blog.codecentric.de/en/2016/08/realtime-fast-data-analytics-druid/

http://www.confluent.io/blog/building-a-streaming-analytics-stack-with-apache-kafka-and-druid

https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani

https://github.com/implydata/pivot UI for Druid

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

http://imply.io/post/2016/07/05/exactly-once-streaming-ingestion.html

Online class

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

https://www.dataquest.io/mission/123/introduction-to-spark/

https://www.dataquest.io/section/big-data-tools

http://www.planetcassandra.org/getting-started-with-apache-cassandra-and-java/

http://rob.conery.io/2015/10/06/how-to-learn-a-new-programming-language-while-maintaining-your-day-job-and-still-being-there-for-your-family/

http://www.infoq.com/presentations/parquet

https://www.youtube.com/watch?v=3UfZN59Nsk8 Google DataFlow

DB engines

https://kylin.apache.org/

http://druid.io/ DRUID column oriented distributed data store

http://habrahabr.ru/post/272267/ KUDU column oriented distributed data store

http://geode.incubator.apache.org/ Geocode

Apache Geode and Apache Ignite are more similar than they are different.

Apache Ignite, based off the commercial distribution Grid Gain http://habrahabr.ru/post/271475/

Apache Geode, based off the commercial distribution GemFire, has a long history in the market.

http://imply.io/

http://blog.brakmic.com/stream-processing-with-apache-flink/

https://www.youtube.com/watch?v=pqp7gLt_MFY

https://phdata.io/real-time-analytics-on-medical-device-data/

http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/

http://blog.acolyer.org/2015/04/27/musketeer-part-i-whats-the-best-data-processing-system/

http://thenewstack.io/building-streaming-data-hub-elasticsearch-kafka-cassandra/

https://news.ycombinator.com/item?id=10337154

https://courses.edx.org/courses/BerkeleyX/CS190.1x/1T2015/info

https://cloudacademy.com/amazon-web-services/courses/amazon-machine-learning

https://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

http://danoyoung.blogspot.com/2015/10/and-bobs-your-uncle.html

http://dataintensive.net/ BOOK

http://thomaswdinsmore.com/

https://nifi.apache.org/

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

https://www.typesafe.com/blog/using-spark-kafka-cassandra-and-akka-on-mesos-for-real-time-personalization

https://databricks.com/blog/2015/06/16/zen-and-the-art-of-spark-maintenance-with-cassandra.html

http://www.spark.tc/real-time-application-performance-profiling-using-spark/

http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html

http://habrahabr.ru/company/it-grad/blog/264549/ STORM KAFKA

http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

https://zeroturnaround.com/rebellabs/so-why-would-i-use-a-distributed-database-like-apache-cassandra-by-christopher-batey/

http://thomaswdinsmore.com/2015/07/13/big-analytics-roundup-july-13-2015/

http://radar.oreilly.com/data

http://www.ibmbigdatahub.com/technology/hadoop-and-spark

https://news.ycombinator.com/item?id=8880259

Drill is a SQL engine and therefore in the same league as Apache Hive, Apache Tajo, or Cloudera's Impala.

Flink (and Spark) focus on use cases that exceed pure SQL (+ a few UDFs) such as Graph processing, Machine Learning, and very custom data flows.

In fact, the use-cases of Spark and Flink overlap a bit. However, the technology used under the hood is quite different. Flink shares a lot of similarities with relational DBMS. Data is serialized in byte buffers and processed a lot in binary representation. This also allows for fine-grained memory control. Flink uses a pipelined processing model and it has a cost-based optimizer that selects execution strategies and avoids expensive partitioning and sorting steps. Moreover, Flink features a special kind of iterations (delta-iterations) that can significantly reduce the amount of computations as iterations go on (the vertex-centric computing model of Pregel / Giraph is a special kind of that).

http://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analytics

Page updated

Google Sites

Report abuse