AWS Athena

Generals

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Data pipeline that performs database automation, Parquet file conversion, table creation, Snappy compression, partitioning, and more.
Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.

Advantages

Start querying instantly: Serverless, no ETL, Athena is serverless. You can quickly query your data without having to setup and manage any servers or data warehouses.
Pay per query: Only pay for data scanned. With Amazon Athena, you pay only for the queries that you run. You are charged $5 per terabyte scanned by your queries.
Open, powerful, standard: Built on Presto, runs standard SQL. Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet.
Fast, really fast: Interactive performance even for large datasets. With Amazon Athena, you don't have to worry about having enough compute resources to get fast, interactive query performance.
Parquet and ORC both store data in columns, while Avro stores data in a row-based format.

Features

Benefits

Easy to use
High performance
No maintenance required
No server configuration required
Multiple tools integration possible

Integration

Amazon S3
AWS Glue
Presto

Limitations

No DDL supported
Works with external table only
User-Defined Functions not supported

Prerequisites

Must have an AWS account
Enable your account to export your cost and usage data into an S3 bucket.
Prepare buckets for Athena to connect.
AWS creates manifest files using metadata every time it writes to the bucket. Create a folder inside the technology-aws-billing-data bucket known as Athena, which contains only the data.
To simplify the setup, we can use one region: the us-west-2 region.
The final step is downloading the credentials for the new IAM user. The credentials will directly map to the database credentials to connect.

Tables, Databases, and the Data Catalog

In Athena, tables and databases are containers for the metadata definitions that define a schema for underlying source data.
For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table.
Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset.
Before querying data, a table must be registered in Athena. The registration occurs when you either create tables automatically or manually.
This registration occurs in the AWS Glue Data Catalog and enables Athena to run queries on the data.
1. To create a table automatically, use an AWS Glue crawler from within Athena.
2. When AWS Glue creates a table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store and retrieve this metadata, using it when you run queries to analyze the underlying dataset.
3. Other AWS services can share the AWS Glue Data Catalog .
4. To create a table manually:
  1. Use the Athena console to run the Create Table Wizard.
  2. Use the Athena console to write Hive DDL statements in the Query Editor.
  3. Use the Athena API or CLI to execute a SQL query string with DDL statements.
  4. Use the Athena JDBC or ODBC driver.

Create Database Example

create database if not exists costdb;
create external table if not exists cost (
InvoiceID string,
PayerAccountId string,
LinkedAccountId string,
RecordType string,
RecordId string,
ProductName string,
RateId string,
SubscriptionId string,
PricingPlanId string,
UsageType string,
Operation string,
AvailabilityZone string,
ReservedInstance string,
ItemDescription string,
UsageStartDate string,
UsageEndDate string,
UsageQuantity string,
Rate string,
Cost string,
ResourceId string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = ''
)
stored as textfile
location 's3://technology-aws-billing-data/athena'

Example Query

select productname, operation,
sum (cast(cost as double))
from costdb.cost
group by 1, 2
order by 3 desc

AWS Service Integration With Athena

AWS CloudTrail: Querying AWS CloudTrail Logs. You can automatically create tables for querying logs directly from the CloudTrail console, and use those tables to run queries in Athena.
Amazon CloudFront: Querying Amazon CloudFront Logs
Elastic Load Balancing: Querying Application Load Balancer Logs to analyze and understand traffic patterns to and from Elastic Load Balancing instances and backend applications. You can see the source of traffic, latency, and bytes transferred.
Amazon Virtual Private Cloud: Querying Amazon VPC Flow Logs. Query the logs in Athena to investigate network traffic patterns and identify threats and risks across your Amazon VPC network.
AWS CloudFormation: Create named queries with AWS CloudFormation and run them in Athena. Named queries allow you to map a query name to a query and then run it as a saved query from the Athena console. Create Athena workgroups using AWS CloudFormation. Use Athena workgroups to isolate queries for you or your group from other queries in the same account.
AWS Glue Data Catalog: Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your AWS account and integrated with the ETL and data discovery features of AWS Glue.
Amazon QuickSight: Connecting to Amazon Athena with ODBC and JDBC Drivers. Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver.
IAM: You can use Athena API actions in IAM permission policies.

Save Cost

This data needs to be partitioned based upon location & date.
A separate Workgroup can be created based upon users, teams, applications or workloads.

Videos

Athena Example With S3

Page updated

Report abuse