Cloudera-Example

Import data from mysql to HDFS using Squoop.

[cloudera@quickstart ~]$ sqoop import-all-tables \

-m 1 \

--connect jdbc:mysql://quickstart:3306/retail_db \

--username=retail_dba \

--password=cloudera \

--compression-codec=snappy \

--as-parquetfile \

--warehouse-dir=/user/hive/warehouse \

--hive-import

Verify

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/categories/

Hive and Impala also allow you to create tables by defining a schema over existing files with 'CREATE EXTERNAL TABLE' statements, similar to traditional relational databases. But Sqoop already created these tables

From IMPALA QUERY BUILDER

invalidate metadata; - , Impala does not poll constantly for metadata changes - so invalidate

show tables; - this will show the tables created

Now queries can be executed in impala query builder

select p.product_id, p.product_name, r.revenue

from products p inner join

(select oi.order_item_product_id, sum(cast(oi.order_item_subtotal as float)) as revenue

from order_items oi inner join orders o

on oi.order_item_order_id = o.order_id

where o.order_status <> 'CANCELED'

and o.order_status <> 'SUSPECTED_FRAUD'

group by order_item_product_id) r

on p.product_id = r.order_item_product_id

order by r.revenue desc

limit 10;

We used Sqoop to import the data into Hive but are using Impala to query the data. This is because Hive and Impala can share both data files and the table metadata. Hive works by compiling SQL queries into MapReduce jobs, which makes it very flexible, whereas Impala executes queries itself and is built from the ground up to be as fast as possible, which makes it better for interactive analysis.

Correlate Structured Data with Unstructured Data

Since Hadoop can store unstructured and semi-structured data alongside structured data without remodelling an entire database, you can just as well ingest, store and process web log events.

For this, you need the web clickstream data. The most common way to ingest web clickstream is to use Flume. Flume is a scalable real-time ingest framework that allows you to route, filter, aggregate, and do “mini-operations” on data on its way in to the scaccess log data into

But the below example is to move log files to HDFC directly from a local directory

[cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/original_access_logs

[cloudera@quickstart ~]$ sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/log_files/access.log.2 /user/hive/warehouse/original_access_logs

Verify

hadoop fs -ls /user/hive/warehouse/original_access_logs

Now we can build a table in Hive and query the data via Impala and Hue.

First, you'll take advantage of Hive's flexible SerDes (serializers / deserializers) to parse the logs into individual fields using a regular expression.

Second, you'll transfer the data from this intermediate table to one that does not require any special SerDe. Once the data is in this table, you can query it much faster and more interactively using Impala.

Sample Data

36.44.59.115 - - [14/Jun/2014:10:30:15 -0400] "GET /department/footwear/category/cardio%20equipment HTTP/1.1" 200 386 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0"

Using the Hive Query Editor app in Hue to execute the following queries:

Below have query will create an external table with the data loaded from the access logs.(from '/user/hive/warehouse/original_access_logs')

CREATE EXTERNAL TABLE intermediate_access_logs (

ip STRING,

date STRING,

method STRING,

url STRING,

http_version STRING,

code1 STRING,

code2 STRING,

dash STRING,

user_agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',

'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")

LOCATION '/user/hive/warehouse/original_access_logs';

Create External Table with Tokenized access log

CREATE EXTERNAL TABLE tokenized_access_logs (

ip STRING,

date STRING,

method STRING,

url STRING,

http_version STRING,

code1 STRING,

code2 STRING,

dash STRING,

user_agent STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/hive/warehouse/tokenized_access_logs';

ADD JAR /usr/lib/hive/lib/hive-contrib.jar;

Below query is using MapReduce job, just like Sqoop import did, to transfer the data from one table to the other in parallel

INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;