AWS Athena
Pretty awesome tool for query files on S3.
Firstly, need to specify a data source linking to the files. The purpose is to provide the schema/metadata of the files so the query engine can query the files.
It will create one table per file, and the tables are kept in a catalog database, which is contained in a catalog. By default the catalog is the AwsDataCatalog. You can create catalog database and tables manually or by crawler.
2 ways to create tables.:
First one is using a crawler in AWS Glue, simply click on 'Coonect data source' and it will guide you to select your S3 location and the Glue Data catalog. Then running the crawler will scan through the S3 folder and create a table schema per file. The table schema is the table saved in your specified catalog database.
Second one is 'create table' and select 'from S3 bucket data'. This way allows you to create database and table and specify columns manually for each file in S3 folder. Note, from 'create table' you can also select 'from AWS Glue Crawler' to use crawler.
When the tables are created, they point to your S3 files and you are ready to query them.
Simply create query on the text editor using normal SQL syntax:
select * from table_a join table_b on ....
Pretty easy to use.
You may also save queries, and create views from queries. Just like using a database. It is amazing how straight forward it is. The only possible confusing part is the concepts used by data source. The database is actually catalog database / metadata database. Within a database, you specify or 'crawl' the metadata for each file, and it is called table. The catalog is just a catalog of metadata databases.