The row key is just another name for the PRIMARY KEY. It is the combination of all the partition and clustering fields, and it will map to just one row of data in a table. So when you do a read or write to a particular row key, it will access just one row.
In terms of the partitioner, that only uses the partition key fields, and it generates a token hash value that determines which node in a cluster the partition will be stored on. Individual rows are stored within partitions, so if there are no clustering columns, then the partition will hold a single row and the row key would be the same as the partition key.
If you have clustering columns, then you can store multiple rows within a partition and the row key will be the partition key plus the clustering key.
Cluster – a collection of nodes or Data Centers arranged in a ring architecture. A name must be assigned to every cluster, which will subsequently be used by the participating nodes
Keyspace – If you are coming from a relational database, then the schema is the respective keyspace in Cassandra. The keyspace is the outermost container for data in Cassandra. The main attributes to set per keyspace are the Replication Factor, the Replica Placement Strategy and the Column Families
Column Family – Column Families in Cassandra are like tables in Relational Databases. Each Column Family contains a collection of rows which are represented by a Map<RowKey, SortedMap<ColumnKey, ColumnValue>>. The key gives the ability to access related data together
Column – A column in Cassandra is a data structure which contains a column name, a value and a timestamp. The columns and the number of columns in each row may vary in contrast with a relational database where data are well structured
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://www.rubyscale.com/post/143067470585/basic-time-series-with-cassandra
http://rustyrazorblade.com/2016/05/working-relationally-with-cassandra/
Insert – {:key => ‘metric-name’, :column_name => TimeUUID(now), :column_value => 0.75}
Issue: row become very wide if there many data points; solution: shard data by day
Insert – {:key => 'metric-name-20110306’, :column_name => TimeUUID(now), :column_value => 0.75}
Cassandra vs HBASE
http://db-engines.com/en/system/Cassandra%3BHBase
http://www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf
The overall memory-and-disk data structure used by both Cassandra and HBase is more or less a log-structured merge tree. The disk component in Cassandra is the SSTable; in HBase it is the HFile.
Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers. Thus, Cassandra guarantees high availability by allowing multiple seed nodes in a cluster, while HBase guarantees the same via standby master nodes -- one of which will become the new master should the current master fail.
Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software. HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks. While HBase ships with a Zookeeper installation, nothing stops you from using a pre-existing Zookeeper ensemble with an HBase database.
Meanwhile, though Cassandra is described as having "eventual" consistency, both read and write consistency can be tuned, not only by level, but in extent. That is, you can configure not only how many replica nodes must successfully complete the operation before it is acknowledged, but also whether the participating replica nodes span data centers.
Further, Cassandra has added lightweight transactions to its repertoire. Cassandra's lightweight transaction is a "compare and set" mechanism roughly comparable to HBase's "check and put" capability; HBase also has a "read-check-delete" operation for which Cassandra has no counterpart. Finally, Cassandra's 2.0 release added row-level write isolation: If a client updates multiple columns in a row, other clients will see either none of the updates or all of the updates.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is therefore important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition -- such as a column that stores the state field of a customer's mailing address. HBase lacks built-in support for secondary indexes, but offers a number of mechanisms that provide secondary index functionality. These are described in HBase's online reference guide and on HBase community blogs.
More like this
While neither Cassandra nor HBase support real transactions, both provide some level of consistency control. HBase gives you strong record-level (that is, row-level) consistency. In fact, HBase supports ACID-level semantics on a per-row basis. Also, you can lock a row in HBase, though this is not encouraged, not only because it hampers concurrency, but also because a row lock will not survive a region split operation. In addition, HBase has a "check and put" operation, which provides atomic "read-modify-write" semantics on a single data element
http://blog.parsely.com/post/1928/cass/
CQL supports prepared statements.
CQL supports 3 kind of collections: Maps, Sets and Lists.
Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all messages sent by a user”, “events registered by a sensor”...), then collections are not appropriate and a specific table (with clustering columns) should be used. Concretely, (non-frozen) collections have the following noteworthy characteristics and limitations:
Individual collections are not indexed internally. Which means that even to access a single element of a collection, the while collection has to be read (and reading one is not paged internally).
While insertion operations on sets and maps never incur a read-before-write internally, some operations on lists do. Further, some lists operations are not idempotent by nature (see the section on lists below for details), making their retry in case of timeout problematic. It is thus advised to prefer sets over lists when possible.
it is a anti-pattern to use a (single) collection to store large amounts of data.
CREATE TABLE users (
id text PRIMARY KEY,
name text, favs map<text, text>, // A map of text keys, and text values
tags set<text>,
scores list <int> // better use set
);
keyspace
primary key
partition key
https://michael.mior.ca/projects/NoSE/
https://opencredo.com/new-blog-cassandra-what-you-may-learn-the-hard-way/
http://techblogsearch.com/a/stream-processing-with-spring-kafka-spark-and-cassandra-part-5.html
http://christopher-batey.blogspot.com/2014/12/getting-started-cassandra-spark-with.html
https://github.com/paulovn/ml-vm-notebook
http://www.confluent.io/blog/how-to-build-a-scalable-etl-pipeline-with-kafka-connect
http://neovintage.org/2016/04/07/data-modeling-in-cassandra-from-a-postgres-perspective/
http://www.grokit.ca/cnt/ApacheHadoop/
http://dtrapezoid.com/time-series-data-modeling-for-medical-devices.html
http://www.planetcassandra.org/
https://github.com/Vijay2win Vijay Parthasarathy
https://www.youtube.com/watch?v=YzBzUsbcAsY
http://www.planetcassandra.org/try-cassandra/
https://academy.datastax.com/demos/getting-started-apache-cassandra-and-java-part-i
https://academy.datastax.com/courses/
https://academy.datastax.com/demos/getting-started-apache-cassandra-and-python-part-i
http://exponential.io/blog/2015/01/08/cassandra-terminology/
http://blog.threatstack.com/scaling-cassandra-lessons-learned
http://exponential.io/blog/2015/01/28/install-cassandra-2_1-on-mac-os-x/
http://www.slideshare.net/alimenkou/high-performance-queues-with-cassandra
How to contribute:
https://wiki.apache.org/cassandra/HowToContribute
https://wiki.apache.org/cassandra/HowToBuild
https://wiki.apache.org/cassandra/RunningCassandraInIDEA
http://www.planetcassandra.org/committing-code-to-apache-cassandra/
http://www.slideshare.net/yukim/cassandrasummit2013
https://www.youtube.com/watch?v=W45Ysb9b6oE Cassandra data modelling
INSERT=UPDATE=UPSERT to avoid reads before writes
In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is theCLUSTERING KEY (key_part_two)
the partition key is the minimum-specifier needed to perform a query using where clause. If you have a composite partition key, like the following
eg: PRIMARY KEY((col1, col2), col10, col4))
You can perform query only passing at least both col1 and col2, these are the 2 columns that defines the partition key.
cqlsh> CREATE KEYSPACE mykeyspace
... WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh:mykeyspace> SELECT * FROM system.schema_keyspaces;
keyspace_name | durable_writes | strategy_class | strategy_options
--------------------+----------------+---------------------------------------------+----------------------------
system_auth | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
system_distributed | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"3"}
system | True | org.apache.cassandra.locator.LocalStrategy | {}
mykeyspace | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
system_traces | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"2"}
http://habrahabr.ru/post/205176/
http://db-engines.com/en/system/Cassandra%3BRedis
http://habrahabr.ru/company/lifestreet/blog/146115/#habracut Cassandra
https://www.instaclustr.com/common-cassandra-data-modelling-traps
http://www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf
http://blogs.atlassian.com/2013/09/do-you-know-cassandra/
http://habrahabr.ru/post/155115/ Cassandra
Для распределения данных кассандра использует технику, известную как согласованное хеширование (consistent hashing). Этот подход позволяет распределить данные между узлами и сделать так, что при добавлении и удалении нового узла количество пересылаемых данных было небольшим. Для этого каждому узлу ставится в соответствие метка (token), которая разбивает на части множество всех md5 значений ключей. Так как в большинстве случаев используется RandomPartitioner, рассмотрим его. Как я уже говорил, RandomPartitioner вычисляет 128-битный md5 для каждого ключа. Для определения в каких узлах будут храниться данные, просто перебираются все метки узлов от меньшего к большему, и, когда значение метки становится больше, чем значение md5 ключа, то этот узел вместе с некоторым количеством последующих узлов (в порядке меток) выбирается для сохранения.
This post focuses on the two layers with the most confusing terminology:
CQL (logical)
Storage (physical)
CQL terminology should be focused on explaining the CQL abstraction layer. Some terms, such as “partition key” are used in both the CQL layer and the storage layer. However, a consistent use of the term to refer to the same thing in both layers will help avoid ambiguity.
The physical layer is how Cassandra actually stores data on disk. Understanding the physical layer is an important part of performance turning and data modeling in Cassandra.
It’s popular to show a table that maps CQL concepts to relational concepts. The mapping of SQL to CQL is designed ease SQL developers into the no-sql world of Cassandra.
However, the use of SQL-like terminology in CQL can confuse matters as many terms have very different meaning in SQL vs. CQL. I have found that Cassandra works more like a database that has only materialized views than it does like a database with relational tables.
FiloDB
http://www.planetcassandra.org/blog/introducing-filodb/
Doradus
https://github.com/dell-oss/Doradus
Doradus is a server framework that runs on top of Cassandra. To build Doradus, the team borrowed from several well-accepted paradigms. They used traditional OLAP techniques to allow data to be arranged into static, multidimensional cubes. They leveraged the vertical orientation and efficient compression of columnar databases. And, from the NoSQL world, they employed sharding. The result: a storage and query engine called Doradus OLAP that stores data up to 1M objects/second/node, providing nearly real-time data warehousing. This architecture also allows for extreme compression of the data, sometimes producing up to a 99% reduction in space usage.