Source - Notes from AWS re:invent 2019....
Data storage
postgreSQL at core. Is MPP, share nothing, entirely columnar; can scale out horizontally 128 compute node, 8.2 PB storage. with added OLAP function (Linear regression, windowing functions, approximate function, geospatial support. everything wrapped in AWS. valentines day 2013.
can connect to leader node using JDBC/ODBC..
amazon redshift spectrum - provides data from files on s3 as external tables to be directly queried as if they are table.
AQUA - Advanced Query Accelerator
Recently added RA3(Amazon redshift analytics), earlier supported only DC2(Dense compute, SSD) & DS2( Dense storage, Magnetic disks).
Columnar architecture improves performances as,
Data to be scanned for any filter statement is very low as compared to row based architecture.
Data compression is also applied independently and to higher possibility as compared to compression on overall table.
Recently added AZ64. has improved compression ratio and performance. supports INT, SMALLINT, BIGINT, TIMESTAMP, TIMESTAMPTZ, DATE, NUMERIC.
for VARCHAR and CHAR, LZO/ZSTD is recommended.
Using Analyze compression utility gives smallest table. Later we can verify which encoding is applied to which column. using - "SELECT 'column', type, encoding from pg_table_def where tablename = 'deep_dive'; "
-Bocks - 1MB, Immutable.
-Zone Maps - min max values per block. all block has ZM automatically.
-Data Sorting - sortkey (col1, col2). this helps zm be more efficient. lower cardinality columns has preference over higher cardinality columns to be selected as sortkey. fewer number of columns in sort key. 1-3 is recommended.
-Materialized Columns - redshift is columnar; hence denormalization is recommended. materializing calculated values also helps.
-Slices- Parallelizing is achieved using data distribution, in four ways. Key, ALL, Even, Auto.
Best practices: Table design Summary
Add compression to columns
add sort keys on the columns that are frequently filtered on
Materialize often filtered columns from dimension tables into fact tables
Materialize often calculated values into tables
co-locate large tables using DISTSTYLE key if the columns do not cause skew
avoid distribution keys on temporal columns
keep data types as wide as necessary(but no longer than necessary).
-----
Data ingestion -
Redundancy- 2-phase commit. Global commit. data is backed up on s3 asynchronously. Temporary tables are not committed hence are twice faster to write than permanent tables.
Transactions - wrap your workflow (ETL) in transaction. else each item goes as a transaction. and commits are expensive.
COPY statement - should have input files broken into multiples of number of available slices. should have delimited('|', ',') files between 1MB - 1GB after gzip compression. should use NULL char (\N)
See number of slices your cluster has?
SELECT count(slice) FROM stv_slices;
Amazon redshift spectrum. - Use INSERT INTO SELECT FROM EXTERNAL amazon s3 tables.
Redshift is designed for large writes. batch processing system, optimized for processing massive amounts of data. small (1-10 rows ) write costs same as larger (~100k rows) writes . UPDATE and DELETE both results internally as deleting row as blocks are immutable and needs to be rewritten. Auto vacuum is used to remove ghost rows from table.
Data ingestion : deduplication /upsert
BEGIN;
CREATE TEMP TABLE staging(LIKE deep_dive);
COPY staging FROM 's3://buvket/dd.csv' : 'creds' COMPUPDATE OFF;
DELETE FROM dee[_dive d USING staging s WHERE d.aid = s.aid;
INSERT INTO deep_dive SELECT * FROM staging;
DROP TABLE staging;
COMMIT;
------
Best practices : ELT
wrap workflow/statements in an explicit transaction
consider using drop table or truncate instead of delete
staging tables :
use temporary tables of permanent table with "BACKUP NO" option
if possible use diststyle key on both staging table and production table to speed up the INSERT INTO SELECT statement
COPY statement :
turn off automatic compression - COMPUPDATE OFF
copy compression settings from the production table (using Like keyword) or manually apply compression to CREATE TABLE DDL (from ANALYZE COMPRESSION output)
for copying a large number of rows(>hundreds of millions) consider using ALTER TABLE APPEND instead of INSERT INTO SELECT.
------------
(Auto) vacuum - runs either manually or automatically in the background. :
It removes rows that are marked as deleted
It globally sorts tables. (for tables with a sort key, ingestion operations will locally sort new data and write it into the unsorted region).
Best practices-
Should be run only as necessary
for majority of workloads AUTO VACUUM DELETE will reclaim space and AUTO TABLE SORT will sort the needed portions of the table
VACUUM BOOST should be run at off peak times.
-----
(Auto) Analyze - It collects table statistics for optimal query planning.
It can be run periodically after ingestion on just the columns that WHERE predicates are filtered on .
-----------------
Best practices: WLM - Work Load Management & QMR Query Monitoring Rules
Use auto WLM- if you aren't sure how to set up WLM or your workload is highly unpredictable, or you are using the old default WLM.
Use manual WLM - If you understand your workload patterns or require throttling certain types of queries depending on the time of day
keep the number of WLM queues to a minimum, typically jjust three queues to avoid having unused queues.
Use WLM to limit ingestion/ELT concurrency to two to three
To maximize query throughput, use WLM to throttle the number of concurrent queries to or less
Use QMR to log long-running queries
Save the superuser queue for administration tasks and cancelling queries.
Cluster sizing and resizing.
sizing amazon redshift cluster for production - Estimate the uncompressed size of the incoming data. Assume 3x compression (actual can be > 4x). Target 30-40% free space (resize to add/remove storage as needed); Disk utilization should be at least 15% and less than 80%. Based on performance requirements, pick SSD or HDD. If required, additional nodes can be added for increased performance.
Resizing amazon redshift cluster - Classic & elastic.
classic
Amazon redshift advisor
Additional resources -