http://hortonworks.com/blog/apache-hive-vs-apache-impala-query-performance-comparison/
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_langref_unsupported.html
https://www.safaribooksonline.com/library/view/hbase-design-patterns/9781783981045/ch05.html
http://stackoverflow.com/questions/13911501/when-to-use-hadoop-hbase-hive-and-pig
https://www.quora.com/How-can-we-represent-a-logical-data-model-for-HBase-and-Hive
http://www.slideshare.net/DouglasMoore/douglas-moore-strata-ny-2014-big-data-anti-patterns-v9-dm
http://www.grokit.ca/cnt/ApacheHadoop/ Kafka, ZooKeeper, HDFS and Cassandra.
http://www.qubole.com/resources/hive-and-hadoop-tutorial-and-training-resources/
https://habrahabr.ru/company/dca/blog/270453/
HCatalog is a storage management tool that enables frameworks other than Hive to leverage a data model to read and write data. HCatalog tables provide an abstraction on the data format in HDFS and allow frameworks such as PIG and MapReduce to use the data without being concerned about the data format, such as RC, ORC, and text files. HCatInputFormat and HCatOutputFormat, which are the implementations of Hadoop InputFormat and OutputFormat, are the interfaces provided to PIG and MapReduce.
HIVE
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.slideshare.net/oom65/optimize-hivequeriespptx
https://qubole.zendesk.com/hc/en-us/articles/208693196-Reference-Hive-Tuning
https://habrahabr.ru/company/dca/blog/283212/
https://habrahabr.ru/company/dca/blog/305838/
https://habrahabr.ru/company/dca/blog/305838/
https://www.qubole.com/resources/cheatsheet/hive-function-cheat-sheet/
Default execution engine is mapReduce but it also supports Tez and Spark
External tables: deleting table deletes only metadata, not the HDFS files
Hive supports several join types: (and HINTS to SQL)
map join (also known as hash join)
busket join
sorted busked merge join
regular join
Use EXPLAIN PLAN to see real plan.
Hive provides a set of in-built SerDes and also allows the user to create custom SerDes based on their data definition. These are as follows:
LazySimpleSerDe
RegexSerDe
AvroSerDe
OrcSerde
ParquetHiveSerDe
JSONSerDe
CSVSerDe
Таблица в hive представляет из себя аналог таблицы в классической реляционной БД. Основное отличие — что данные hive’овских таблиц хранятся прост в виде обычных файлов на hdfs. Это могут быть обычные текстовые csv-файлы, бинарные sequence-файлы, более сложные колоночные paruqet-файлы и другие форматы.
Hive provides the following set of analytical functions:
RANK
DENSE_RANK
ROW_NUMBER
PERCENT_RANK
CUME_DIST
NTILE
hive работает на основе MapReduce — интерактивности ждать не стоит ...
Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data.
Hive supports several complex collection data types such as arrays , maps and structs as table column data types
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>);
PARTITIONED BY (country STRING, state STRING);
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/stocks';
Partitioning tables changes how Hive structures the data storage. If we create this table in the mydb database, there will still be an employees directory for the table:
hdfs://master_server/user/hive/warehouse/mydb.db/employees
However, Hive will now create subdirectories reflecting the partitioning structure. For example:
... .../employees/country=CA/state=AB .../employees/country=CA/state=BC
SORT BY orders the data only with in each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. Better performance is traded for total ordering.
Hive contains several built-in functions to manipulate the arrays and maps. One example is the explode() function, which outputs the items of an array or a map as separate rows. You can use the explaincommand to view the execution plan of a Hive query. In addition to the simple text files, Hive also supports several other binary storage formats that can be used to store the underlying data of the tables. These include row-based storage formats such as Hadoop SequenceFiles and Avro files as well as column-based (columnar) storage formats such as ORC files and Parquet.
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
http://www.infoq.com/presentations/Facebook-Hive-Hadoop
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
http://www.infoq.com/presentations/Facebook-Hive-Hadoop
HBASE
https://hbase.apache.org/book.html
https://www.amazon.com/HBase-Design-Patterns-Mark-Kerzner/dp/1783981040
http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options
HBASE vs Cassandra +OpenTSB for time series database
https://habrahabr.ru/company/badoo/blog/324510/
All row keys are sortedeach region stores a range of of sorted keys; each region ins pinned to region server (ie node in cluster).
Unlike Hive, HBase is designed to be faster to access to smaller, more specific data sets. Rather than using bulky map reduce jobs to churn through lots of data, it focuses on writing lots of data fast and reading small amounts very fast.
Performing range scans or high cardinality lookups in HBase is superior. HBase is typically deployed for applications requiring fast lookup of data and data merging use cases. HBase is often used with BI aggregation tools such as Apache Kylin
The row key in a HBase model is the only way of sorting and indexing data natively. This key is also used to split data into regions in a similar way partitions are created in relational table.
An HBase table consists of rows, which are identified by row key.
Each row has an arbitrary (potentially, very large) number of columns.
Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
Each (row, column) combination can have multiple versions of the data, identified by timestamp.
there are no data types in HBase — values in HBase are just one or more bytes
Data compression happens at column family level therefore schema design needs to be done with this in mind. It is good to keep column family name and qualifier short because for every row of data they are repeated. This reduces data stored and read by HBase. The number of column families should be kept to the bare minimum to keep the number of HFiles to a minimum level. Using fewest column families reduces disk space consumed and improves load time.
row(Row contains row key and one or more columns with value associated with them)
column(A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character)
column family(having set of columns and their values,the column families should be considered carefully during schema design) All elements of a column family are stored together in single HFile. It is important to limit the number of Column Families to a relatively small amount
column qualifier(A column qualifier is added to a column family to provide the index for a given piece of data)
cell(A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version)
timestamp( represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell)
The key in HBase is actually made up of several parts: row key, column family, column and timestamp. Timestamp is the killer feature of HBase. It provides a way to store several versions of a while, which makes it a good choice for storing data series data. The key-value pair model looks like this now:
(row key, column family, column, timestamp) -> value
HBase table corresponds to the following Java datastructure:
SortedMap<RowKey, List<SortedMap <Column, List<Value, TimeStamp>>>>
Row keys are lexicographically sorted.
Region: Although HBase tables host billions of rows,it is not possible to store all that together. Therefore, it is divided into chunks. A Region is responsible for holding a set of rows. Also when you keep adding more rows and its size grows beyond a threshold value, it automatically splits in two regions. A region hosts a continuous set of rows.
RegionsServer: Regions are hosted by Region Server. A Region Server can host many Regions but a Region is always hosted by one and only one Region Server.
Example:
create table myTable with column family: play
create 'myTable', {NAME=> 'play', VERSIONS=>org.apache.hadoop.hbase.HConstants::ALL_VERSIONS}
describe 'myTable'
DESCRIPTION ENABLED
'myTable', {NAME => 'play',
DATA_BLOCK_ENCODING => 'NONE',
BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0',
VERSIONS => '2147483647',
COMPRESSION => ' true NONE',
MIN_VERSIONS => '0',
TTL => '2147483647',
KEEP_DELETED_CELLS => 'false',
BLOCKSIZE => '65536',
IN_MEMORY => 'false',
ENCODE_ON_DISK => 'true',
BLOCKCACHE => 'true'}
count 'myTable'
To define the schema, several properties about HBase’s tables have to be taken into account.
1.Indexing is only done based on the Key.
2.Tables are stored sorted based on the row key. Each region in the table is responsible for a part of the row key space and is identified by the start and end row key.The region contains a sorted list of rows from the start key to the end key.
3.Everything in HBase tables is stored as a byte[ ]. There are no types.
4.Atomicity is guaranteed only at a row level. There is no atomicity guarantee across rows, which means that there are no multi-row transactions.
5.Column families have to be defined up front at table creation time.
6.Column qualifiers are dynamic and can be defined at write time. They are stored as byte[ ] so you can even put data in them.
HBase’s API for data manipulation consists of three primary methods: Get, Put, and Scan. Gets and Puts are specific to particular rows and need the row key to be provided. Scans are done over a range of rows
put = insert or update
HBase API defines two ways to read data:
Point lookup: get record for a given row_key.
Point scan: read all records in [startRow, stopRow) range.
Both kinds of scans allow to specify:
A column family we're interested in
A particular column we're interested in
The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for
versions of columns that were valid at some specific timestamp value;
all versions that were valid within a specifed [minStamp, maxStamp) interval.
N most recent versions
HBase shell has 'scan' command, here's an example of its output:
hbase(main):007:0> scan 'testtable' ROW COLUMN+CELL myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1 myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2 myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3
Here, one HBase row produces multiple rows in the query output. Each output row represents one (row_id, column) combination, so rows with multiple columns (and multiple revisions of column data) can be easily represented.
Coprocessor
http://www.3pillarglobal.com/insights/hbase-coprocessors
Coprocessor can be broadly divided into two categories – Observer and Endpoint – and each one is discussed separately:
1. Observer Coprocessor: these are just like database triggers, i.e. they execute your custom code on the occurrence of certain events. If you prefer (or if you are from Java background) you can also visualize it like advice (before and after only). Coprocessors allow you to hook your custom code in two places during the lifecycle of the event. One is just before the occurrence of the event For example, it will allow your custom code to run just before the ‘Put’ operation. All methods providing this feature will start with the prefix ‘pre.’ For example, if you want to your code to be executed before the put operation then you should override following method of RegionObserver class...
Observer Coprocessor also provides hooks for your code to get executed just after the occurrence of the event (similar to after advice in AOP terminology). These methods will start with the prefix ‘post.’ For example, if you want your code to be executed after the ‘Put’ operation, you should override following method ...
Either your class should extend one of the Coprocessor classes (like BaseRegionObserver) or it should implement Coprocessor interfaces (like Coprocessor, CoprocessorService).
Load the Coprocessor: Currently there are two ways to load the Coprocessor. One is static (i.e. loading from configuration) and the other is dynamic (i.e. loading from table descriptor either through Java code or through ‘hbase shell’). Both are discussed below in detail.
Finally your client-side code to call the Coprocessor. This is the easiest step, as HBase handles the Coprocessor transparently and you don’t have to do much to call the Coprocessor.
Endpoint Coprocessor: This kind of Coprocessor can be compared to stored procedure found in RDBMS. They help in performing computation which is not possible either through observe Coprocessor or otherwise. For example, calculating average or summation over the entire table that spans across multiple regions. They do so by providing a hook for your custom code and then
Coprocessors are not designed to be used by the end users but by developers. Coprocessors are executed directly on region server; therefore a faulty/malicious code can bring your region server
https://habrahabr.ru/company/dca/blog/280700/
Data Model: (rowkey, cf:column, timestamp) -> Value
Issue: access by PK only
Range sharding
Sync replication
Append only
HBase is key-Value DB complements HDFS’ capabilities by providing
- fast and random reads and writes and
- supporting updating data.
HBase is strongly consistent and optimized for writes
Column Family
https://www.mapr.com/blog/guidelines-hbase-schema-design
http://bigdatanoob.blogspot.com/
But this online access came at the cost of scan performance.
HBase key-values are stored in Tables, and each table is split into multiple regions. A region is a continuous range within the key space, meaning all the rows in the table that sort between the region’s start key and end key are stored in the same region. More details here- helpful article from Hortonworks.
http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html vs Cassandra
http://kairosdb.github.io/ TS database optimized for writes on top of cassandra