HBASE (key-value) and Hive (runs SQL as map-reduce)

http://www.grokit.ca/cnt/ApacheHadoop/ Kafka, ZooKeeper, HDFS and Cassandra.

http://www.qubole.com/resources/hive-and-hadoop-tutorial-and-training-resources/

https://habrahabr.ru/company/dca/blog/270453/

HCatalog is a storage management tool that enables frameworks other than Hive to leverage a data model to read and write data. HCatalog tables provide an abstraction on the data format in HDFS and allow frameworks such as PIG and MapReduce to use the data without being concerned about the data format, such as RC, ORC, and text files. HCatInputFormat and HCatOutputFormat, which are the implementations of Hadoop InputFormat and OutputFormat, are the interfaces provided to PIG and MapReduce.

HIVE

https://cwiki.apache.org/confluence/display/Hive/Home

http://www.slideshare.net/oom65/optimize-hivequeriespptx

https://qubole.zendesk.com/hc/en-us/articles/208693196-Reference-Hive-Tuning

https://habrahabr.ru/company/dca/blog/283212/

https://habrahabr.ru/company/dca/blog/305838/

https://www.qubole.com/resources/cheatsheet/hive-function-cheat-sheet/

Default execution engine is mapReduce but it also supports Tez and Spark

External tables: deleting table deletes only metadata, not the HDFS files

Hive supports several join types: (and HINTS to SQL)

- map join (also known as hash join)
- busket join
- sorted busked merge join
- regular join

Use EXPLAIN PLAN to see real plan.

Hive provides a set of in-built SerDes and also allows the user to create custom SerDes based on their data definition. These are as follows:

LazySimpleSerDe
RegexSerDe
AvroSerDe
OrcSerde
ParquetHiveSerDe
JSONSerDe
CSVSerDe

Таблица в hive представляет из себя аналог таблицы в классической реляционной БД. Основное отличие — что данные hive’овских таблиц хранятся прост в виде обычных файлов на hdfs. Это могут быть обычные текстовые csv-файлы, бинарные sequence-файлы, более сложные колоночные paruqet-файлы и другие форматы.

Hive provides the following set of analytical functions:

RANK
DENSE_RANK
ROW_NUMBER
PERCENT_RANK
CUME_DIST
NTILE

hive работает на основе MapReduce — интерактивности ждать не стоит ...

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data.

Hive supports several complex collection data types such as arrays , maps and structs as table column data types

CREATE TABLE employees (

name STRING,

salary FLOAT,

subordinates ARRAY<STRING>,

deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>);

PARTITIONED BY (country STRING, state STRING);

CLUSTERED BY (exchange, symbol)

SORTED BY (ymd ASC)

INTO 96 BUCKETS

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/stocks';

Partitioning tables changes how Hive structures the data storage. If we create this table in the mydb database, there will still be an employees directory for the table:

hdfs://master_server/user/hive/warehouse/mydb.db/employees

However, Hive will now create subdirectories reflecting the partitioning structure. For example:

... .../employees/country=CA/state=AB .../employees/country=CA/state=BC

SORT BY orders the data only with in each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. Better performance is traded for total ordering.

Hive contains several built-in functions to manipulate the arrays and maps. One example is the explode() function, which outputs the items of an array or a map as separate rows. You can use the explaincommand to view the execution plan of a Hive query. In addition to the simple text files, Hive also supports several other binary storage formats that can be used to store the underlying data of the tables. These include row-based storage formats such as Hadoop SequenceFiles and Avro files as well as column-based (columnar) storage formats such as ORC files and Parquet.

http://www.larsgeorge.com/2009/10/hive-vs-pig.html

http://www.infoq.com/presentations/Facebook-Hive-Hadoop

http://www.larsgeorge.com/2009/10/hive-vs-pig.html

http://www.infoq.com/presentations/Facebook-Hive-Hadoop

HBASE

https://hbase.apache.org/book.html

https://www.amazon.com/HBase-Design-Patterns-Mark-Kerzner/dp/1783981040

http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options

HBASE vs Cassandra +OpenTSB for time series database

https://habrahabr.ru/company/badoo/blog/324510/

All row keys are sortedeach region stores a range of of sorted keys; each region ins pinned to region server (ie node in cluster).

Unlike Hive, HBase is designed to be faster to access to smaller, more specific data sets. Rather than using bulky map reduce jobs to churn through lots of data, it focuses on writing lots of data fast and reading small amounts very fast.

Performing range scans or high cardinality lookups in HBase is superior. HBase is typically deployed for applications requiring fast lookup of data and data merging use cases. HBase is often used with BI aggregation tools such as Apache Kylin

The row key in a HBase model is the only way of sorting and indexing data natively. This key is also used to split data into regions in a similar way partitions are created in relational table.

- An HBase table consists of rows, which are identified by row key.
- Each row has an arbitrary (potentially, very large) number of columns.
- Columns are split into column groups, column groups define how the columns are stored (not reading some column groups is an optimization).
- Each (row, column) combination can have multiple versions of the data, identified by timestamp.

there are no data types in HBase — values in HBase are just one or more bytes

Data compression happens at column family level therefore schema design needs to be done with this in mind. It is good to keep column family name and qualifier short because for every row of data they are repeated. This reduces data stored and read by HBase. The number of column families should be kept to the bare minimum to keep the number of HFiles to a minimum level. Using fewest column families reduces disk space consumed and improves load time.

row(Row contains row key and one or more columns with value associated with them)

column(A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character)

column family(having set of columns and their values,the column families should be considered carefully during schema design) All elements of a column family are stored together in single HFile. It is important to limit the number of Column Families to a relatively small amount

column qualifier(A column qualifier is added to a column family to provide the index for a given piece of data)

cell(A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version)

timestamp( represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell)

The key in HBase is actually made up of several parts: row key, column family, column and timestamp. Timestamp is the killer feature of HBase. It provides a way to store several versions of a while, which makes it a good choice for storing data series data. The key-value pair model looks like this now:

(row key, column family, column, timestamp) -> value

HBase table corresponds to the following Java datastructure:

SortedMap<RowKey, List<SortedMap <Column, List<Value, TimeStamp>>>>

Row keys are lexicographically sorted.

Region: Although HBase tables host billions of rows,it is not possible to store all that together. Therefore, it is divided into chunks. A Region is responsible for holding a set of rows. Also when you keep adding more rows and its size grows beyond a threshold value, it automatically splits in two regions. A region hosts a continuous set of rows.

RegionsServer: Regions are hosted by Region Server. A Region Server can host many Regions but a Region is always hosted by one and only one Region Server.

Example:

create table myTable with column family: play

create 'myTable', {NAME=> 'play', VERSIONS=>org.apache.hadoop.hbase.HConstants::ALL_VERSIONS}

describe 'myTable'

DESCRIPTION ENABLED

'myTable', {NAME => 'play',

DATA_BLOCK_ENCODING => 'NONE',

BLOOMFILTER => 'NONE',

REPLICATION_SCOPE => '0',

VERSIONS => '2147483647',

COMPRESSION => ' true NONE',

MIN_VERSIONS => '0',

TTL => '2147483647',

KEEP_DELETED_CELLS => 'false',

BLOCKSIZE => '65536',

IN_MEMORY => 'false',

ENCODE_ON_DISK => 'true',

BLOCKCACHE => 'true'}

count 'myTable'

To define the schema, several properties about HBase’s tables have to be taken into account.

1.Indexing is only done based on the Key.

2.Tables are stored sorted based on the row key. Each region in the table is responsible for a part of the row key space and is identified by the start and end row key.The region contains a sorted list of rows from the start key to the end key.

3.Everything in HBase tables is stored as a byte[ ]. There are no types.

4.Atomicity is guaranteed only at a row level. There is no atomicity guarantee across rows, which means that there are no multi-row transactions.

5.Column families have to be defined up front at table creation time.

6.Column qualifiers are dynamic and can be defined at write time. They are stored as byte[ ] so you can even put data in them.

HBase’s API for data manipulation consists of three primary methods: Get, Put, and Scan. Gets and Puts are specific to particular rows and need the row key to be provided. Scans are done over a range of rows

put = insert or update

HBase API defines two ways to read data:

- Point lookup: get record for a given row_key.
- Point scan: read all records in [startRow, stopRow) range.

Both kinds of scans allow to specify:

- A column family we're interested in
- A particular column we're interested in

The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for

- versions of columns that were valid at some specific timestamp value;
- all versions that were valid within a specifed [minStamp, maxStamp) interval.
- N most recent versions

HBase shell has 'scan' command, here's an example of its output:

hbase(main):007:0> scan 'testtable' ROW COLUMN+CELL myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1 myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2 myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3

Here, one HBase row produces multiple rows in the query output. Each output row represents one (row_id, column) combination, so rows with multiple columns (and multiple revisions of column data) can be easily represented.

Coprocessor

http://www.3pillarglobal.com/insights/hbase-coprocessors

Coprocessor can be broadly divided into two categories – Observer and Endpoint – and each one is discussed separately:

1. Observer Coprocessor: these are just like database triggers, i.e. they execute your custom code on the occurrence of certain events. If you prefer (or if you are from Java background) you can also visualize it like advice (before and after only). Coprocessors allow you to hook your custom code in two places during the lifecycle of the event. One is just before the occurrence of the event For example, it will allow your custom code to run just before the ‘Put’ operation. All methods providing this feature will start with the prefix ‘pre.’ For example, if you want to your code to be executed before the put operation then you should override following method of RegionObserver class...

Observer Coprocessor also provides hooks for your code to get executed just after the occurrence of the event (similar to after advice in AOP terminology). These methods will start with the prefix ‘post.’ For example, if you want your code to be executed after the ‘Put’ operation, you should override following method ...

1. Either your class should extend one of the Coprocessor classes (like BaseRegionObserver) or it should implement Coprocessor interfaces (like Coprocessor, CoprocessorService).
2. Load the Coprocessor: Currently there are two ways to load the Coprocessor. One is static (i.e. loading from configuration) and the other is dynamic (i.e. loading from table descriptor either through Java code or through ‘hbase shell’). Both are discussed below in detail.
3. Finally your client-side code to call the Coprocessor. This is the easiest step, as HBase handles the Coprocessor transparently and you don’t have to do much to call the Coprocessor.

Endpoint Coprocessor: This kind of Coprocessor can be compared to stored procedure found in RDBMS. They help in performing computation which is not possible either through observe Coprocessor or otherwise. For example, calculating average or summation over the entire table that spans across multiple regions. They do so by providing a hook for your custom code and then

Coprocessors are not designed to be used by the end users but by developers. Coprocessors are executed directly on region server; therefore a faulty/malicious code can bring your region server

https://habrahabr.ru/company/dca/blog/280700/

Data Model: (rowkey, cf:column, timestamp) -> Value

Issue: access by PK only

Range sharding

Sync replication

Append only

HBase is key-Value DB complements HDFS’ capabilities by providing

- fast and random reads and writes and

- supporting updating data.

HBase is strongly consistent and optimized for writes

Column Family

https://www.mapr.com/blog/guidelines-hbase-schema-design

http://bigdatanoob.blogspot.com/

But this online access came at the cost of scan performance.

HBase key-values are stored in Tables, and each table is split into multiple regions. A region is a continuous range within the key space, meaning all the rows in the table that sort between the region’s start key and end key are stored in the same region. More details here- helpful article from Hortonworks.

https://blog.appdynamics.com/apm/why-we-chose-hbase-over-other-nosql-databases-for-our-metric-service-rearchitecture/

http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html vs Cassandra

http://kairosdb.github.io/ TS database optimized for writes on top of cassandra

Page updated

Google Sites

Report abuse