Glue Studio

AWS Glue Studio

A graph based studio for designing data pipeline jobs.

Like many other ETL tools, it provides source, transform and target components.

A source can be Glue Data Catelog, S3, Kinesis, Redshift, Different SQL servers, etc.

A transform can be join, split, filter, mapping, custom code, etc.

A target can be Glue Data Catalog, S3, SQL, etc. similar to a source.

A connector connects to some data source not natively supported by AWS. It can be a custom connector fit for your own purpose, e.g. getting data from Google Big Query. You can put it onto AWS Marketplace for sale or purchase existing custom connectors developed by other people from Marketplace as well.

Simply go the Visual tab of Glue Studio. Start adding source, transform, target components.

For example, select S3 source pointing to a csv file, updating a few attributes like delimiter, quote character, first line has headers, etc.

Add a tranform to rename some of the columns. Choose the source node as the parent node so this connects to the source.

Add a S3 target, chose the transform as parent node. Glue studio doesn't support drag and drop connections so you need to specify parent node. I guess AWS will support drag and drop in the future. Choose the export format and if you want to add the target's metadata to Data Catalog.

Just a note for the S3 target, it can just point to an S3 folder, so the result will be saved to the folder. But you don't know the target file's name. The Glue job (runs spark) outputs something like "run-DataSink0-4-part-r-00000" even without the extension. So if you choose csv as output, you wont have a .csv file however the content is in csv format.

You may also check the generated PySpark script in the Script tab. You can copy paste the script onto e.g. jupiter notebook to unit test running line by line.

Note that if you change the script outside of Glue Studio and copy the script back to Glue, it wont work. Any change has to be done within Glue Studio.

Other than that, simply save the job and click run.

Page updated

Google Sites

Report abuse