AWS - Redshift

What is Amazon Redshift?

Amazon Redshift is a fast, fully managed data warehouse.It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.

Amazon Redshift supports Amazon Virtual Private Cloud (Amazon VPC), SSL, AES-256 encryption and Hardware Security Modules (HSMs) to protect your data in transit and at rest.

Data Warehouse System Architecture

What is Redshift Spectrum?

Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required

How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?

variety of innovations to achieve up to ten times higher performance than traditional databases 

Columnar Data Storage (allow for increased speed when it comes to accessing large amounts of data)

Advanced Compression:

Massively Parallel Processing (MPP) ( MPP is a distributed design approach in which several processors apply a “divide and conquer” strategy to large data jobs .A large processing job is organized into smaller jobs which are then distributed among a cluster of processors (compute nodes). ) 

Sort and Dist keys (Sort keys and Distribution keys decide how data is stored and indexed across all Redshift nodes.)

Redshift Spectrum

What does a leader node do? What does a compute node do?

A leader node receives queries from client applications, parses the queries and develops execution plans

Compute nodes execute the steps specified in the execution plans and transmit data among themselves to serve these queries.

How do I load data into my Amazon Redshift data warehouse?

You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon DynamoDB, Amazon EMR, AWS Glue, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises.

Yes, clients can connect to Amazon Redshift using ODBC or JDBC and issue 'insert' SQL commands to insert the data

Building Multi-AZ or Multi-Region Amazon Redshift Clusters

Amazon Redshift offers different node types

RA3 or DC2 depending on the required performance, data size, and expected data growth.

RA3 nodes with managed storage to optimize data warehouse by scaling and paying for compute and managed storage independently. storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage

Choose the number of nodes based on your performance requirements and only pay for the managed storage use. 

RA3 cluster based on the amount of data you process daily. You launch clusters that use the RA3 node types in a virtual private cloud (VPC).

You can't launch RA3 clusters in EC2-Classic

 DS2 nodes enable you to create large data warehouses using hard disk drives (HDDs)

Choose the number of nodes on data size and performance requirements.

DC2 nodes store your data locally for high performance, and as the data size grows, you can add more compute nodes to increase the storage capacity of the cluster.

 You can't launch DC2 clusters in EC2-Classic

Row vs Column Oriented Databases

Row oriented databases are databases that organize data by record, keeping all of the data associated with a record next to each other in memory. 

Postgres

MySQL

Column oriented databases are databases that organize data by field, keeping all of the data associated with a field next to each other in memory. Columnar databases have grown in popularity and provide performance advantages to querying data. They are optimized for reading and computing on columns efficiently.

Common column-oriented databases:

·       Redshift

·       BigQuery

·       Snowflake