AWS Glue

Generals

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
You can create and run an ETL job with a few clicks in the AWS Management Console.
You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog.
Once cataloged, your data is immediately searchable, queryable, and available for ETL.
Generates ETL scripts to transform, flatten, and enrich your data from source to target.
AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.
AWS Glue is designed to work with semi-structured data.
Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer the schema, format, and data types of your data.
It introduces a component called a dynamic frame, which you can use in your ETL scripts. A dynamic frame is similar to an Apache Spark dataframe, which is a data abstraction used to organize data into rows and columns, except that each record is self-describing so no schema is required initially.
Detects schema changes and adapts based on your preferences.
You can convert between dynamic frames and Spark dataframes,
You can use the AWS Glue console to discover data, transform it, and make it available for search and querying.
You can edit, debug, and test your Python or Scala Apache Spark ETL code using a familiar development environment.
You can use AWS Glue to organize, cleanse, validate, and format data for storage in a data warehouse or data lake.
Triggers your ETL jobs based on a schedule or event.
Scales resources, as needed, to run your jobs.
AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum.
You can create event-driven ETL pipelines with AWS Glue. You can run your ETL jobs as soon as new data becomes available in Amazon S3 by invoking your AWS Glue ETL jobs from an AWS Lambda function.
You can use AWS Glue to understand your data assets.

Benefits

Less hassle: AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.
Cost effective: AWS Glue is serverless. There is no infrastructure to provision or manage.
More power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.

Architecture

You define jobs in AWS Glue to accomplish the work that's required to extract, transform, and load (ETL) data from a data source to a data target. You typically perform the following actions:
1. For data store sources, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You point your crawler at a data store, and the crawler creates table definitions in the Data Catalog. For streaming sources, you manually define Data Catalog tables and specify data stream properties.
2. AWS Glue can generate a script to transform your data. Or, you can provide the script in the AWS Glue console or API.
3. You can run your job on demand, or you can set it up to start when a specified trigger occurs. The trigger can be a time-based schedule or an event.

Terms

AWS Glue Data Catalog: The persistent metadata store in AWS Glue. It contains table definitions, job definitions, and other control information to manage your AWS Glue environment. Each AWS account has one AWS Glue Data Catalog per region.
Classifier: Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others. It also provides classifiers for common relational database management systems using a JDBC connection. You can write your own classifier by using a grok pattern or by specifying a row tag in an XML document.
Connection: A Data Catalog object that contains the properties that are required to connect to a particular data store.
Crawler: A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog.
Database: A set of associated Data Catalog table definitions organized into a logical group.
Data store, data source, data target: A data store is a repository for persistently storing your data. Examples include Amazon S3 buckets and relational databases. A data source is a data store that is used as input to a process or transform. A data target is a data store that a process or transform writes to.
Development endpoint: An environment that you can use to develop and test your AWS Glue ETL scripts.
Dynamic Frame: A distributed table that supports nested data such as structures and arrays. Each record is self-describing, designed for schema flexibility with semi-structured data. Each record contains both data and the schema that describes that data. You can use both dynamic frames and Apache Spark dataframes in your ETL scripts, and convert between them.
Job: The business logic that is required to perform ETL work. It is composed of a transformation script, data sources, and data targets. Job runs are initiated by triggers that can be scheduled or triggered by events.
Notebook server: A web-based environment that you can use to run your PySpark statements. PySpark is a Python dialect for ETL programming.
Script: Code that extracts data from sources, transforms it, and loads it into targets. AWS Glue generates PySpark or Scala scripts.
Table: The metadata definition that represents your data. Whether your data is in Amazon S3 file, RDS table, or another set of data, a table defines the schema of your data. A table in the AWS Glue Data Catalog consists of the names of columns, data type definitions, partition information, and other metadata about a base dataset. The schema of your data is represented in your AWS Glue table definition. The actual data remains in its original data store.
Transform: The code logic that is used to manipulate your data into a different format.
Trigger: Initiates an ETL job. Triggers can be defined based on a scheduled time or an event.

Components

AWS Glue Console: To define and orchestrate your ETL workflow. The console calls several API operations in the AWS Glue Data Catalog and AWS Glue Jobs system to perform the following tasks:
1. Define AWS Glue objects such as jobs, tables, crawlers, and connections.
2. Schedule when crawlers run.
3. Define events or schedules for job triggers.
4. Search and filter lists of AWS Glue objects.
5. Edit transformation scripts.
AWS Glue Data Catalog: It is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore. Each AWS account has one AWS Glue Data Catalog per AWS region. You can use AWS Identity and Access Management (IAM) policies to control access to the data sources managed by the AWS Glue Data Catalog.

Videos

Creating one AWS Glue Job.

Page updated

Report abuse