cassandra

- https://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra
- https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
- The row key is just another name for the PRIMARY KEY. It is the combination of all the partition and clustering fields, and it will map to just one row of data in a table. So when you do a read or write to a particular row key, it will access just one row.
- In terms of the partitioner, that only uses the partition key fields, and it generates a token hash value that determines which node in a cluster the partition will be stored on. Individual rows are stored within partitions, so if there are no clustering columns, then the partition will hold a single row and the row key would be the same as the partition key.
- If you have clustering columns, then you can store multiple rows within a partition and the row key will be the partition key plus the clustering key.
- Cluster – a collection of nodes or Data Centers arranged in a ring architecture. A name must be assigned to every cluster, which will subsequently be used by the participating nodes
- Keyspace – If you are coming from a relational database, then the schema is the respective keyspace in Cassandra. The keyspace is the outermost container for data in Cassandra. The main attributes to set per keyspace are the Replication Factor, the Replica Placement Strategy and the Column Families
- Column Family – Column Families in Cassandra are like tables in Relational Databases. Each Column Family contains a collection of rows which are represented by a Map<RowKey, SortedMap<ColumnKey, ColumnValue>>. The key gives the ability to access related data together
- Column – A column in Cassandra is a data structure which contains a column name, a value and a timestamp. The columns and the number of columns in each row may vary in contrast with a relational database where data are well structured

http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

http://www.rubyscale.com/post/143067470585/basic-time-series-with-cassandra

http://rustyrazorblade.com/2016/05/working-relationally-with-cassandra/

Insert – {:key => ‘metric-name’, :column_name => TimeUUID(now), :column_value => 0.75}
Issue: row become very wide if there many data points; solution: shard data by day
Insert – {:key => 'metric-name-20110306’, :column_name => TimeUUID(now), :column_value => 0.75}

Cassandra vs HBASE

http://db-engines.com/en/system/Cassandra%3BHBase

http://www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf

The overall memory-and-disk data structure used by both Cassandra and HBase is more or less a log-structured merge tree. The disk component in Cassandra is the SSTable; in HBase it is the HFile.

Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers. Thus, Cassandra guarantees high availability by allowing multiple seed nodes in a cluster, while HBase guarantees the same via standby master nodes -- one of which will become the new master should the current master fail.

Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software. HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks. While HBase ships with a Zookeeper installation, nothing stops you from using a pre-existing Zookeeper ensemble with an HBase database.

Meanwhile, though Cassandra is described as having "eventual" consistency, both read and write consistency can be tuned, not only by level, but in extent. That is, you can configure not only how many replica nodes must successfully complete the operation before it is acknowledged, but also whether the participating replica nodes span data centers.

Further, Cassandra has added lightweight transactions to its repertoire. Cassandra's lightweight transaction is a "compare and set" mechanism roughly comparable to HBase's "check and put" capability; HBase also has a "read-check-delete" operation for which Cassandra has no counterpart. Finally, Cassandra's 2.0 release added row-level write isolation: If a client updates multiple columns in a row, other clients will see either none of the updates or all of the updates.

In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is therefore important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition -- such as a column that stores the state field of a customer's mailing address. HBase lacks built-in support for secondary indexes, but offers a number of mechanisms that provide secondary index functionality. These are described in HBase's online reference guide and on HBase community blogs.

More like this

- - Review: HBase is massively scalable -- and hugely complex
  - Review: Cassandra lowers the barriers to big data

- While neither Cassandra nor HBase support real transactions, both provide some level of consistency control. HBase gives you strong record-level (that is, row-level) consistency. In fact, HBase supports ACID-level semantics on a per-row basis. Also, you can lock a row in HBase, though this is not encouraged, not only because it hampers concurrency, but also because a row lock will not survive a region split operation. In addition, HBase has a "check and put" operation, which provides atomic "read-modify-write" semantics on a single data element

http://blog.parsely.com/post/1928/cass/

CQL supports prepared statements.

CQL supports 3 kind of collections: Maps, Sets and Lists.

Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all messages sent by a user”, “events registered by a sensor”...), then collections are not appropriate and a specific table (with clustering columns) should be used. Concretely, (non-frozen) collections have the following noteworthy characteristics and limitations:

- Individual collections are not indexed internally. Which means that even to access a single element of a collection, the while collection has to be read (and reading one is not paged internally).
- While insertion operations on sets and maps never incur a read-before-write internally, some operations on lists do. Further, some lists operations are not idempotent by nature (see the section on lists below for details), making their retry in case of timeout problematic. It is thus advised to prefer sets over lists when possible.

it is a anti-pattern to use a (single) collection to store large amounts of data.

CREATE TABLE users (

id text PRIMARY KEY,

name text, favs map<text, text>, // A map of text keys, and text values

tags set<text>,

scores list <int> // better use set

);

keyspace

primary key

partition key

https://michael.mior.ca/projects/NoSE/

https://opencredo.com/new-blog-cassandra-what-you-may-learn-the-hard-way/

http://batey.info/

http://techblogsearch.com/a/stream-processing-with-spring-kafka-spark-and-cassandra-part-5.html

http://christopher-batey.blogspot.com/2014/12/getting-started-cassandra-spark-with.html

https://github.com/paulovn/ml-vm-notebook

http://www.confluent.io/blog/how-to-build-a-scalable-etl-pipeline-with-kafka-connect

http://neovintage.org/2016/04/07/data-modeling-in-cassandra-from-a-postgres-perspective/

http://www.grokit.ca/cnt/ApacheHadoop/

http://dtrapezoid.com/time-series-data-modeling-for-medical-devices.html

http://www.planetcassandra.org/

https://github.com/Vijay2win Vijay Parthasarathy

https://www.youtube.com/watch?v=YzBzUsbcAsY

http://www.planetcassandra.org/try-cassandra/

http://jug.ru/meetings/816

https://academy.datastax.com/demos/getting-started-apache-cassandra-and-java-part-i

https://academy.datastax.com/

https://academy.datastax.com/courses/

https://academy.datastax.com/demos/getting-started-apache-cassandra-and-python-part-i

http://exponential.io/blog/2015/01/08/cassandra-terminology/

http://blog.threatstack.com/scaling-cassandra-lessons-learned

http://exponential.io/blog/2015/01/28/install-cassandra-2_1-on-mac-os-x/

http://www.slideshare.net/alimenkou/high-performance-queues-with-cassandra

How to contribute:

https://wiki.apache.org/cassandra/HowToContribute

https://wiki.apache.org/cassandra/HowToBuild

https://wiki.apache.org/cassandra/RunningCassandraInIDEA

http://www.planetcassandra.org/committing-code-to-apache-cassandra/

http://www.slideshare.net/yukim/cassandrasummit2013

https://www.youtube.com/watch?v=W45Ysb9b6oE Cassandra data modelling

INSERT=UPDATE=UPSERT to avoid reads before writes

In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is theCLUSTERING KEY (key_part_two)

the partition key is the minimum-specifier needed to perform a query using where clause. If you have a composite partition key, like the following

eg: PRIMARY KEY((col1, col2), col10, col4))

You can perform query only passing at least both col1 and col2, these are the 2 columns that defines the partition key.

cqlsh> CREATE KEYSPACE mykeyspace

... WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

cqlsh:mykeyspace> SELECT * FROM system.schema_keyspaces;

keyspace_name | durable_writes | strategy_class | strategy_options

--------------------+----------------+---------------------------------------------+----------------------------

system_auth | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}

system_distributed | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"3"}

system | True | org.apache.cassandra.locator.LocalStrategy | {}

mykeyspace | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}

system_traces | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"2"}

http://habrahabr.ru/post/205176/

http://db-engines.com/en/system/Cassandra%3BRedis

http://habrahabr.ru/company/lifestreet/blog/146115/#habracut Cassandra

https://www.instaclustr.com/common-cassandra-data-modelling-traps

http://www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf

http://blogs.atlassian.com/2013/09/do-you-know-cassandra/

http://habrahabr.ru/post/155115/ Cassandra

Для распределения данных кассандра использует технику, известную как согласованное хеширование (consistent hashing). Этот подход позволяет распределить данные между узлами и сделать так, что при добавлении и удалении нового узла количество пересылаемых данных было небольшим. Для этого каждому узлу ставится в соответствие метка (token), которая разбивает на части множество всех md5 значений ключей. Так как в большинстве случаев используется RandomPartitioner, рассмотрим его. Как я уже говорил, RandomPartitioner вычисляет 128-битный md5 для каждого ключа. Для определения в каких узлах будут храниться данные, просто перебираются все метки узлов от меньшего к большему, и, когда значение метки становится больше, чем значение md5 ключа, то этот узел вместе с некоторым количеством последующих узлов (в порядке меток) выбирается для сохранения.

This post focuses on the two layers with the most confusing terminology:

CQL (logical)
Storage (physical)

Proposed terminology

CQL

CQL terminology should be focused on explaining the CQL abstraction layer. Some terms, such as “partition key” are used in both the CQL layer and the storage layer. However, a consistent use of the term to refer to the same thing in both layers will help avoid ambiguity.

Physical

The physical layer is how Cassandra actually stores data on disk. Understanding the physical layer is an important part of performance turning and data modeling in Cassandra.

CQL to relational

It’s popular to show a table that maps CQL concepts to relational concepts. The mapping of SQL to CQL is designed ease SQL developers into the no-sql world of Cassandra.

However, the use of SQL-like terminology in CQL can confuse matters as many terms have very different meaning in SQL vs. CQL. I have found that Cassandra works more like a database that has only materialized views than it does like a database with relational tables.

FiloDB

http://www.planetcassandra.org/blog/introducing-filodb/

Doradus

https://github.com/dell-oss/Doradus

Doradus is a server framework that runs on top of Cassandra. To build Doradus, the team borrowed from several well-accepted paradigms. They used traditional OLAP techniques to allow data to be arranged into static, multidimensional cubes. They leveraged the vertical orientation and efficient compression of columnar databases. And, from the NoSQL world, they employed sharding. The result: a storage and query engine called Doradus OLAP that stores data up to 1M objects/second/node, providing nearly real-time data warehousing. This architecture also allows for extreme compression of the data, sometimes producing up to a 99% reduction in space usage.

Page updated

Google Sites

Report abuse