AWS-Redshift Best practices
Conducting a proof of concept for Amazon Redshift
Data load time – using the COPY command is a common way to test how long it takes to load data. For more information, see Amazon Redshift best practices for loading data.
Throughput of the cluster – measuring queries per hour is a common way to determine throughput. To do so,
Data security – you can easily encrypt data at rest and in transit with Amazon Redshift. Amazon Redshift also supports single sign-on (SSO) integration. Amazon Redshift pricing includes built-in security, data compression, backup storage, and data transfer.
Third-party tools integration – you can use either a JDBC or ODBC connection to integrate with business intelligence and other external tools.
Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3, and Amazon Kinesis.
Resizing – your evaluation should include increasing the number or types of Amazon Redshift nodes. Evaluate that the workload throughput before and after a resize meets any variability of the volume of your workload. For more information, see Resizing clusters in Amazon Redshift in the Amazon Redshift Cluster Management Guide.
Concurrency scaling – this feature helps you handle variability of traffic volume in your data warehouse. With concurrency scaling, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. For more information, see Working with concurrency scaling.
Automatic workload management (WLM) – prioritize your business critical queries over other queries by using automatic WLM. Try setting up queues based on your workloads (for example, a queue for ETL and a queue for reporting). Then enable automatic WLM to allocate the concurrency and memory resources dynamically. For more information, see Implementing automatic WLM.
Amazon Redshift Advisor – the Advisor develops customized recommendations to increase performance and optimize costs by analyzing your workload and usage metrics for your cluster.
Table design – Amazon Redshift provides great performance out of the box for most workloads. When you create a table, the default sort key and distribution key is AUTO. For more information, see Working with automatic table optimization.
Lake house integration – with built-in integration, try using the out-of-box Amazon Redshift Spectrum feature. With Redshift Spectrum, you can extend the data warehouse into your data lake and run queries against petabytes of data in Amazon S3 using your existing cluster. For more information, see Querying external data using Amazon Redshift Spectrum.
-----------------------------------------------------------------------------------------------------------------------------------------
Which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries.
Query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. Designate both the dimension table's primary key and the fact table's corresponding foreign key as the DISTKEY.
Types of Distribution Styles
KEY Distribution Style :Distributes table rows based on the values in one specified column, the distribution key (DISTKEY).
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
amount DECIMAL(10, 2)
)
DISTKEY (customer_id);er_id);
EVEN Distribution Style :Distributes table rows evenly across all nodes in the cluster.
CREATE TABLE table_name (
column1 data_type,
column2 data_type,
...
)
DISTSTYLE EVEN;
ALL Distribution Style :Distributes a full copy of the table to every node.
Dimension tables to use ALL distribution.
DISTSTYLE ALL is a table distribution style that stores a full copy of the entire table on each node in the Redshift cluster.
When to Use DISTSTYLE ALL
Small Lookup Tables: Ideal for small dimension tables that are frequently joined with fact tables. Since the entire table is copied to all nodes, joins can be performed locally without data shuffling across nodes.
Read-Heavy Workloads: Useful when read performance is critical, and the overhead of storing multiple copies is manageable.
Frequent Joins: When the table is frequently joined with other large tables, this distribution style can help minimize the need for redistribution of data during join operations.
Disadvantages
Storage Overhead: Consumes more storage space since each node holds a full copy of the table. This can be prohibitive for large tables.
Insert and Update Performance: Insert and update operations can be slower since every node needs to be updated.
Advantages
Improved Join Performance: Reduces the need for data redistribution across nodes during joins, which can significantly speed up query performance.
Simplicity: Easy to manage for small tables without worrying about choosing the right distribution keys.
Here’s an example of creating a small dimension table with DISTSTYLE ALL:
CREATE TABLE products (
product_id INT,
product_name VARCHAR(100),
category VARCHAR(50)
) DISTSTYLE ALL;
Amazon Redshift choose the appropriate distribution style, specify AUTO for the distribution style.
A SORTKEY (sort key) is a feature used to optimize the performance of queries by defining how the data in a table is sorted. Amazon Redshift stores your data on disk in sorted order according to the sort key, which can significantly speed up query execution, especially for large tables.
There are two types of sort keys: compound and interleaved.
Compound Sort Key
Definition: Consists of one or more columns that are specified in a particular order.
Usage: Redshift physically stores the data in sorted order based on the first column, then the second column, and so on.
Performance: Particularly useful for queries that filter or join on the leading columns in the sort key. Queries benefit from range-restricted scans, leading to faster query performance.
Interleaved Sort Key
Definition: Consists of one or more columns, but treats them with equal importance by distributing the sort across all specified columns.
Usage: Suitable for tables that are queried using a variety of different columns, not just the leading columns.
Performance: Offers more balanced performance for queries that filter on any of the sort key columns. It is less efficient than a compound sort key for queries that consistently filter on the leading columns.
Compound Sort Key Example
CREATE TABLE sales (
sale_id INT,
sale_date DATE,
customer_id INT,
amount DECIMAL(10, 2)
) COMPOUND SORTKEY (sale_date, customer_id);
Interleaved Sort Key Example
CREATE TABLE sales (
sale_id INT,
sale_date DATE,
customer_id INT,
amount DECIMAL(10, 2)
) INTERLEAVED SORTKEY (sale_date, customer_id);
Best Practices
Analyze Query Patterns: Choose the sort key type and columns based on the most common query patterns. Use compound sort keys for consistent filtering on specific columns and interleaved sort keys for diverse query patterns.
Data Distribution: Ensure that the data distribution and sort keys work together to optimize performance. Poor choices can lead to inefficient queries and increased storage requirements.
Maintenance: Regularly review and adjust sort keys as query patterns and data distribution evolve over time.
Let COPY choose compression encodings
Define primary key and foreign key constraints-->Do not define primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.
Use the smallest possible column size--> Don't make it a practice to use the maximum column size for convenience.
CREATE TABLE region (
R_REGIONKEY bigint NOT NULL,
R_NAME varchar(25),
R_COMMENT varchar(152))
diststyle all;
CREATE TABLE nation (
N_NATIONKEY bigint NOT NULL,
N_NAME varchar(25),
N_REGIONKEY bigint,
N_COMMENT varchar(152))
diststyle all;
create table customer (
C_CUSTKEY bigint NOT NULL,
C_NAME varchar(25),
C_ADDRESS varchar(40),
C_NATIONKEY bigint,
C_PHONE varchar(15),
C_ACCTBAL decimal(18,4),
C_MKTSEGMENT varchar(10),
C_COMMENT varchar(117))
diststyle all;
create table orders (
O_ORDERKEY bigint NOT NULL,
O_CUSTKEY bigint,
O_ORDERSTATUS varchar(1),
O_TOTALPRICE decimal(18,4),
O_ORDERDATE Date,
O_ORDERPRIORITY varchar(15),
O_CLERK varchar(15),
O_SHIPPRIORITY Integer,
O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);
create table part (
P_PARTKEY bigint NOT NULL,
P_NAME varchar(55),
P_MFGR varchar(25),
P_BRAND varchar(10),
P_TYPE varchar(25),
P_SIZE integer,
P_CONTAINER varchar(10),
P_RETAILPRICE decimal(18,4),
P_COMMENT varchar(23))
diststyle all;
create table supplier (
S_SUPPKEY bigint NOT NULL,
S_NAME varchar(25),
S_ADDRESS varchar(40),
S_NATIONKEY bigint,
S_PHONE varchar(15),
S_ACCTBAL decimal(18,4),
S_COMMENT varchar(101))
diststyle all;
create table lineitem (
L_ORDERKEY bigint NOT NULL,
L_PARTKEY bigint,
L_SUPPKEY bigint,
L_LINENUMBER integer NOT NULL,
L_QUANTITY decimal(18,4),
L_EXTENDEDPRICE decimal(18,4),
L_DISCOUNT decimal(18,4),
L_TAX decimal(18,4),
L_RETURNFLAG varchar(1),
L_LINESTATUS varchar(1),
L_SHIPDATE date,
L_COMMITDATE date,
L_RECEIPTDATE date,
L_SHIPINSTRUCT varchar(25),
L_SHIPMODE varchar(10),
L_COMMENT varchar(44))
distkey (L_ORDERKEY)
sortkey (L_RECEIPTDATE);
create table partsupp (
PS_PARTKEY bigint NOT NULL,
PS_SUPPKEY bigint NOT NULL,
PS_AVAILQTY integer,
PS_SUPPLYCOST decimal(18,4),
PS_COMMENT varchar(199))
diststyle even;
Loading Data
COPY region FROM 's3://redshift-immersionday-labs/data/region/region.tbl.lzo'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;
Note: A few key takeaways from the above COPY statements.
COMPUPDATE PRESET ON will assign compression using the Amazon Redshift best practices related to the data type of the column but without analyzing the data in the table.
COPY for the REGION table points to a specific file (region.tbl.lzo) while COPY for other tables point to a prefix to multiple files (lineitem.tbl.).
COPY for the SUPPLIER table points a manifest file (supplier.json). A manifest file is a JSON file that lists the files to be loaded and their locations.