AWS-Redshift Maintenance Tables
Automatic Tables Maintenance - ANALYZE and VACUUM
Analyze:
When loading into an empty table, the COPY command by default collects statistics (ANALYZE). If you are loading a non-empty table using COPY command, in most cases, you don't need to explicitly run the ANALYZE command. Amazon Redshift monitors changes to your workload and automatically updates statistics in the background. To minimize impact to your system performance, automatic analyze runs during periods when workloads are light.
ANALYZE orders;
Vacuum
Vacuum Delete: When you perform delete on a table, the rows are marked for deletion(soft deletion), but are not removed. When you perform an update, the existing rows are marked for deletion(soft deletion) and updated rows are inserted as new rows. Amazon Redshift automatically runs a VACUUM DELETE operation in the background to reclaim disk space occupied by rows that were marked for deletion by UPDATE and DELETE operations, and compacts the table to free up the consumed space. To minimize impact to your system performance, automatic VACUUM DELETE runs during periods when workloads are light.
Lets see how VACUUM DELETE reclaims table space after delete operation.
select "table", size, tbl_rows, estimated_visible_rows
from SVV_TABLE_INFO where "table" = 'orders';
Now, run the VACUUM DELETE command. Copy the following command and run it.
vacuum delete only orders;
Vacuum Sort: When you define SORT KEYS on your tables, Amazon Redshift automatically sorts data in the background to maintain table data in the order of its sort key. Having sort keys on frequently filtered columns makes block level pruning, which is already efficient in Amazon Redshift, more efficient.
If you need to run VACUUM SORT, you can still manually run it as shown below. Copy the following command and run it.
vacuum sort only orders;
Vacuum recluster:
Use VACUUM recluster whenever possible for manual VACUUM operations. This is especially important for large objects with frequent ingestion and queries that access only the most recent data. Vacuum recluster only sorts the portions of the table that are unsorted and hence runs faster. Portions of the table that are already sorted are left intact. This command doesn't merge the newly sorted data with the sorted region. It also doesn't reclaim all space that is marked for deletion. In order to run vacuum recluster on orders, copy the following command and run it.
vacuum recluster orders;
Vacuum boost:
Boost runs the VACUUM command with additional compute resources, as they're available. With the BOOST option, VACUUM operates in one window and blocks concurrent deletes and updates for the duration of the VACUUM operation. Note that running vacuum with the BOOST option contends for system resources, which might affect performance of other queries. As a result, it is recommended to run the VACUUM BOOST when the load on the system is light, such as during maintenance operations. In order to run vacuum recluster on orders table in boost mode, copy the following command and run it.
vacuum recluster orders boost;
Spectrum Query Tuning- Performance Diagnostics
There are a few utilities that provide visibility into Redshift Spectrum:
EXPLAIN - Provides the query execution plan, which includes info around what processing is pushed down to Spectrum. EXPLAIN - Amazon Redshift
SVL_S3QUERY_SUMMARY - Provides statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics of past query runs.(Provisioned Cluster)
SYS_EXTERNAL_QUERY_DETAIL - Provides statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics of past query runs. (Serverless Redshift Cluster)
Provisioned Cluster Run the following query on the SVL_S3QUERY_SUMMARY table(Provisioned Cluster):
select elapsed, s3_scanned_rows, s3_scanned_bytes, s3query_returned_rows, s3query_returned_bytes, files, avg_request_parallelism from svl_s3query_summary where query = pg_last_query_id() order by query,segment;
Serverless Cluster : Run the following query on the SYS_EXTERNAL_QUERY_DETAIL table(Serverless Cluster):
SELECT query_id, file_location, End_time - Start_time as elapsed, total_partitions, qualified_partitions, scanned_files, returned_rows, returned_bytes FROM SYS_EXTERNAL_QUERY_DETAIL ORDER BY query_id, start_time DESC;
Load JSON Data
SUPER DATA TYPE
The SUPER data type in Amazon Redshift allows you to store semi-structured or nested data in a single column. This data type is designed to handle complex, hierarchical data structures such as JSON, Avro, ORC, and Parquet. The SUPER data type provides flexibility and scalability to manage and query semi-structured data without the need to pre-define a schema.
Create a table to store this data without worrying about the schema or underlying structure:
drop table if exists transaction;
create table transaction (
data_json SUPER
);
Now, load the data using the COPY statement. Notice the noshred option which will cause the data to be loaded into one field. We will see later how to Load JSON data with shredding
copy transaction
from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role default region 'us-east-1'
json 'noshred';
Unnest the Data:
To unnest data in Amazon Redshift, especially when dealing with the SUPER data type and nested JSON structures, you can use the PARTIQL functions. These functions allow you to easily navigate and extract elements from nested data.