Amazon Redshift is a fully managed, fast and powerful, petabyte scale data warehouse service.
Redshift automatically helps
set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity.
patches and backs up the data warehouse, storing the backups for a user-defined retention period
monitors the nodes and drives to help recovery from failures
Significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.
provide fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections.
Uses replication and continuous backups to enhance availability and improve data durability and can automatically recover from node and component failures.
scale up or down with a few clicks in the AWS Management Console or with a single API call
distribute & parallelize queries across multiple physical resources
supports VPC, SSL, AES-256 encryption and Hardware Security Modules (HSMs) to protect the data in transit and at rest.
Redshift only supports Single-AZ deployments and the nodes are available within the same AZ, if the AZ supports Redshift clusters
Redshift provides monitoring using CloudWatch and metrics for compute utilization, storage utilization, and read/write traffic to the cluster are available with the ability to add user-defined custom metrics.
Redshift provides Audit logging and AWS CloudTrail integration
Redshift can be easily enabled to a second region for disaster recovery.
Amazon Redshift Enhanced VPC Routing provides VPC resources access to Redshift.
Redshift will not be able to access the S3 VPC endpoints without enabling Enhanced VPC routing, so one option is not going to support the scenario if another is not selected.
You can configure Amazon Redshift to copy snapshots for a cluster to another AWS Region. To configure cross-Region snapshot copy, you need to enable this copy feature for each cluster and configure where to copy snapshots and how long to keep copied automated or manual snapshots in the destination AWS Region.
When cross-Region copy is enabled for a cluster, all new manual and automated snapshots are copied to the specified AWS Region. Copied snapshot names are prefixed with copy:.
Amazon Redshift uses a hierarchy of encryption keys to encrypt the database. You can use either AWS Key Management Service (AWS KMS) or a hardware security module (HSM) to manage the top-level encryption keys in this hierarchy. The process that Amazon Redshift uses for encryption differs depending on how you manage keys.
Massively Parallel Processing (MPP)
automatically distributes data and query load across all nodes.
makes it easy to add nodes to the data warehouse and enables fast query performance as the data warehouse grows.
Columnar Data Storage
organizes the data by column, as column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets
columnar data is stored sequentially on the storage media, and require far fewer I/Os, greatly improving query performance
Advanced Compression
Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk.
employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores.
doesn’t require indexes or materialized views and so uses less space than traditional relational database systems.
automatically samples the data and selects the most appropriate compression scheme, when the data is loaded into an empty table
Single Node
single node configuration enables getting started quickly and cost-effectively & scale up to a multi-node configuration as the needs grow.
Multi-Node
Multi-node configuration requires a leader node that manages client connections and receives queries, and two or more compute nodes that store data and perform queries and computations.
Leader node
provisioned automatically and not charged for receives queries from client applications, parses the queries and develops execution plans, which are an ordered set of steps to process these queries.
Coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes and finally returns the results back to the client applications.
Compute node
can contain from 1-128 compute nodes, depending on the node type.
executes the steps specified in the execution plans and transmit data among themselves to serve these queries.
intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.
Supports Dense Storage or Dense Compute nodes (DC) instance type
Dense Storage (DS): allow creation of very large data warehouses using hard disk drives (HDDs) for a very low price point.
Dense Compute (DC): allow creation of very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
direct access to compute nodes is not allowed
Redshift replicates the data within the data warehouse cluster and continuously backs up the data to S3 (11 9’s durability)
Redshift mirrors each drive’s data to other nodes within the cluster.
Redshift will automatically detect and replace a failed drive or node.
If a drive fails, Redshift
cluster will remain available in the event of a drive failure
the queries will continue with a slight latency increase while Redshift rebuilds the drive from replica of the data on that drive which is stored on other drives within that node.
single node clusters do not support data replication and the cluster needs to be restored from snapshot on S3
In case of node failure(s), Redshift
automatically provisions new node(s) and begins restoring data from other drives within the cluster or from S3
prioritizes restoring the most frequently queried data so the most frequently executed queries will become performant quickly
cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the cluster
In case of Redshift cluster AZ goes down, Redshift
cluster is unavailable until power and network access to the AZ are restored
cluster’s data is preserved and can be used once AZ becomes available
cluster can be restored from any existing snapshots to a new AZ within the same region
Redshift replicates all the data within the data warehouse cluster when it is loaded and also continuously backs up the data to S3.
Redshift always attempts to maintain at least three copies of the data.
Redshift enables automated backups of the data warehouse cluster with a 1-day retention period, by default, which can be extended to max 35 days.
Automated backups can be turned off by setting the retention period as 0.
Redshift can also asynchronously replicate the snapshots to S3 in another region for disaster recovery
Redshift allows scaling of the cluster either by
increasing the node instance type (Vertical scaling)
increasing the number of nodes (Horizontal scaling)
Redshift scaling changes are usually applied during the maintenance window or can be applied immediately
Redshift scaling process
Existing cluster remains available for read operations only while a new data warehouse cluster gets created during scaling operations.
Data from the compute nodes in the existing data warehouse cluster is moved in parallel to the compute nodes in the new cluster
When the new data warehouse cluster is ready, the existing cluster will be temporarily unavailable while the canonical name record of the existing cluster is flipped to point to the new data warehouse cluster.
RDS is ideal for
structured data and running traditional relational databases while offloading database administration
for online-transaction processing (OLTP) and for reporting and analysis
Redshift is ideal for
large volumes of structured data that needs to be persisted and queried using standard SQL and existing BI tools
analytic and reporting workloads against very large data sets by harnessing the scale and resources of multiple nodes and using a variety of optimizations to provide improvements over RDS.
preventing reporting and analytic processing from interfering with the performance of the OLTP workload.
EMR is ideal for
processing and transforming unstructured or semi-structured data to bring in to Amazon Redshift and for data sets that are relatively transitory, not stored for long-term use.
Table distribution style determines how data is distributed across compute nodes and helps minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
Redshift supports four distribution styles; AUTO, EVEN, KEY, or ALL.
KEY distribution
A single column acts as distribution key (DISTKEY) and helps place matching values on the same node slice.
As a rule of thumb you should choose a column that:
Is uniformly distributed – Otherwise skew data will cause unbalances in the volume of data that will be stored in each compute node.
a JOIN column – for tables related with dimensions tables (star-schema), it is better to choose as DISTKEY the field that acts as the JOIN field with the larger dimension table, so that matching values from the common columns are physically stored together, reducing the amount of data that needs to be broadcasted through the network.
Even Distribution
distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.
Choose EVEN distribution
when the table does not participate in joins
when there is not a clear choice between KEY and ALL distribution.
All Distribution
whole table is replicated in every compute node.
ensures that every row is collocated for every join that the table participates in
ideal for for relatively slow moving tables, tables that are not updated frequently or extensively
Small dimension tables DO NOT benefit significantly from ALL distribution, because the cost of redistribution is low.
Auto Distribution
Redshift assigns an optimal distribution style based on the size of the table data for e.g. apply ALL distribution for a small table and as it grows changes it to Even distribution
Amazon Redshift applies AUTO distribution, be default.
Sort keys define the order data in which the data will be stored.
Only one sort key per table can be defined, but it can be composed with one or more columns.
Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans
The are two kinds of sort keys in Redshift: Compound and Interleaved.
Compound Keys
A compound key is made up of all of the columns listed in the sort key definition, in the order they are listed.
A compound sort key is more efficient when query predicates use a prefix, or query’s filter applies conditions, such as filters and joins, which is a subset of the sort key columns in order.
Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY.
Interleaved Sort Keys
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
An interleaved sort key is more efficient when multiple queries use different columns for filters
Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
Use cases involve performing ad-hoc multi-dimensional analytics, which often requires pivoting, filtering and grouping data using different columns as query dimensions.
Constrains
Redshift supports UNIQUE, PRIMARY KEY and FOREIGN KEY constraints, however they are only with informational purposes.
Redshift does not perform integrity checks for these constraints and are used by query planner, as hints, in order to optimize executions.
Redshift does enforce NOT NULL column constraints.
Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
Redshift provides query queues, in order to manage concurrency and resource planning. Each queue can be configured with the following parameters:
Slots: number of concurrent queries that can be executed in this queue.
Working memory: percentage of memory assigned to this queue.
Max. Execution Time: the amount of time a query is allowed to run before it is terminated.
Queries can be routed to different queues using Query Groups and User Groups. As a rule of thumb, is considered a best practice to have separate queues for long running resource-intensive queries and fast queries that don’t require big amounts of memory and CPU.
By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. A maximum of eight queues can be defined, with each queue configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
Sort Key Selection
Redshift stores the data on disk in sorted order according to the sort key, which helps query optimizer to determine optimal query plans.
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
Queries are more efficient because they can skip entire blocks that fall outside the time range.
If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.
If you frequently join a table, specify the join column as both the sort key and the distribution key.
Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Distribution Style Selection
Distribute the fact table and one dimension table on their common columns.
Your fact table can have only one distribution key. Any tables that join on another key aren’t collocated with the fact table.
Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows.
Designate both the dimension table’s primary key and the fact table’s corresponding foreign key as the DISTKEY.
Choose the largest dimension based on the size of the filtered dataset.
Only the rows that are used in the join need to be distributed, so consider the size of the dataset after filtering, not the size of the table.
Choose a column with high cardinality in the filtered result set.
If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal.
Change some dimension tables to use ALL distribution.
If a dimension table cannot be collocated with the fact table or other important joining tables, query performance can be improved significantly by distributing the entire table to all of the nodes.
Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations.
Other Practices
Automatic compression produces the best results
COPY command analyzes the data and applies compression encodings to an empty table automatically as part of the load operation
Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
Don’t use the maximum column size for convenience.
Loading Data
You can load data into the tables using the three following methods:
Using Multi-Row INSERT
Using Bulk INSERT
Using COPY command
Staging tables
Copy Command
COPY command loads data in parallel from S3, EMR, DynamoDB, or multiple data sources on remote hosts.
COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
Use a Single COPY Command to Load from Multiple Files
DON’T use multiple concurrent COPY commands to load one table from multiple files as Redshift is forced to perform a serialized load, which is much slower.
Split the Load Data into Multiple Files
divide the data in multiple files with equal size ( between 1MB and 1GB)
Manage the number of files to be a multiple of the number of slices in the cluster helps to distribute workload uniformly in the cluster.
Use a Manifest File
S3 provides eventual consistency for some operations, so it is possible that new data will not be available immediately after the upload, which could result in an incomplete data load or loading stale data.
Data consistency can be managed using a manifest file to load data.
Manifest file helps specify different S3 locations in a more efficient way that with the use of S3 prefixes.
Compress Your Data Files
Individually compress the load files using gzip, lzop, bzip2, or standard for large datasets
Avoid to use compression if you have small amount of data because the benefit of compression would be outweighed by the processing cost of decompression.
If the priority is to reduce the time spent by COPY commands use LZO compression. In the other hand if the priority is to reduce the size of the files in S3 and the network bandwidth use BZ2 compression.
Load Data in Sort Key Order
Load your data in sort key order to avoid needing to vacuum.
Presorting rows is not needed in each load because COPY sorts each batch of incoming data as it loads.
Load Data using IAM role
Designing Queries
Avoid using select *. Include only the columns you specifically need.
Use a CASE Expression to perform complex aggregations instead of selecting from the same table multiple times.
Don’t use cross-joins unless absolutely necessary
Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200).
Use predicates to restrict the dataset as much as possible.
In the predicate, use the least expensive operators that you can.
Avoid using functions in query predicates.
If possible, use a WHERE clause to restrict the dataset.
Add predicates to filter tables that participate in joins, even if the predicates apply the same filters.
By default, any cluster that you create is closed to everyone.
IAM credentials only control access to the Amazon Redshift API-related resources: the Amazon Redshift console, Command Line Interface (CLI), API, and SDK.
To enable access to the cluster from SQL client tools via JDBC or ODBC, you use security groups:
If you are using the EC2-Classic platform for your Amazon Redshift cluster, you must use Amazon Redshift security groups.
If you are using the EC2-VPC platform for your Amazon Redshift cluster, you must use VPC security groups.