Snowflake
Snowflake warehouse is created from snowflake's own site. On creation, you can choose the cloud platform to run it, azure, aws or gcp
One key feature of Snowflake is the decoupling of storage and compute.
Architecture
The architecture can be broken into 3 areas:
1. Database storage, unstructured and unstructured data in internally optimized columnar format.
Snowflake use Micro Partitions at the bottom level to store data.
This is a bit like HDFS, small blocks (50 -500MB) are used as basic storage units.
The blocks can sit on Azure BLob or AWS S3, etc.
When new data comes in, it is divided and mapped to micro partitions using the ordering of the data as it is loaded. Then the data is compressed. The metadata is generated and stored about the new data.
The benefit of micro partition is bringing down the query granularity. The metadata helps to target only the required blocks instead of scanning the whole storage.
2. query processing, using the compute resources by virtual warehouse to run queries.
It needs Virtual Warehouse (compute and memory) to run queries (DDL / DML / loading data). It caches query results for future use as well.
3. cloud services, supporting services coordinates activities across the platform. this puts everything together.
It runs infrastructure management (eg. cloud, block stuff), metadata management (data attributes and statistics), query parsing and execution (execution plan), authentication &. access control
Some notes
Snowflake itself is a data warehouse / data platform, it has no ETL tool built in.
same data can be run from different virtual data warehouses for different purposes.
snowflake can read data straight from data lake.
In Snowflake, you can use the databases / storage shared by someone else (usually read-only). You don't own or pay for the storage, but you can use your own compute to query and use the database. When sharing (inbound or outbound), one can choose the database, the selected tables and views to be shared.
Snowflake can spawn multiple clusters (not just multiple nodes) to service unexpected workload. There is also auto suspend if idle for more than period, and auto resume as well. This is more handy than azure sql that needs explicit pause/resume.
Some limited charts (line, scatte,r bar, scorecard) are supported in the query results. Those charts can be shared with other snowflake users as well.
In snowflake, the marketplace allows people to share / sell datasets/bases.
Create+
Once selected a database and schema, the top right shows a Create button, which can create heaps of stuff for the schema, e.g.
table, view, strage, storeage integration, file format, sequence, Pipe, Stream, Tas, Function, Procedure, Dynamic table.
Table,
simply database table, from DML, or external, or a file, or a select statement.
e.g. from external, can point to an azure block storage directory
a file format must be specified, even same format like csv can have different delimiters, quotations, etc.
on the back, the creation of table from external is essentially a COPY statement, i.e., copy to table from external storage.
View
can be standard or materialized view
Stage
external table is a snowflake feature allowing you to query data stored in an external stage as if the data was inside a table in snowflake.
Other tables require internal stage (storage) in snowflake.
where creating stage, (selecting the database and schema, top right -> create - > stage), there can be Snowflake managed stage, or external stage from S3, Azure, or GCP
stage is listed under database/schema, next to tables
to map an azure blob container as stage, firstly go to azure blob's container-> shared access token
generate an account key, select all the permissions (read, list, write, etc.), click Generate SAS token and URL
copy the URL back to the 'create stage' UI, it may replaces the https with azure:, make the token is in as well. Once created, under the stage UI in snowflake, a list of files folders in the container is listed.
Stage integration
Simply saves the token, authentication, etc. for connecting to external cloud storage (azure, aws,..)
It can be used for creating Stage, so doesn't need to copy the token and url every time.
File Format
type = { CSV | JSON | AVRO | ORC | PARQUET | XML }
Sequence
it's like a range() in python, with start, increment and comment.
I guess its for creating a sequence for joining to other data. e.g. year 1900 to 2000.
Pipe
simply Copy statement for copying data files from stage to a table.
Stream
Not really a data stream, but it records DML (data manipulation) changes made to a table.
It's more like CDC in sql server, stream itself does not contain any table data. A stream only stores an offset for the source object and returns CDC records by leveraging the versioning history for the source object. One example of a consumer of CDC records is a data pipeline, in which only the data in staging tables that has changed since the last extraction is transformed and copied into other tables.
Task
simply a scheduler that runs sql / stored procedure
Function/procedure
database function / stored procedure
Dynamic table
Similar to materialized view, runs a sql and surface the result like a table.
However, materialized view supports only a single table, but dynamic table supports complex joins and unions of diff tables.
Warehouse
The compute resource is associated with warehouse. When run a worksheet, it needs specified the warehouse to run with.
The data / database however, lives without reliance on the compute / warehouse.
Use statements
use role sysadmin //the user role
use warehouse compute_wh //select the compute resource by name
use database abc //db
use schema dbo //schema
Select Json data / semi-structured data
say a json table saves raw json text string, the data type is 'variant'.
snowflake can select json elements in the select statement, e.g.
select
col_name:weather[0].name::string as name,
col_name:wind.speed:float as speed
from json_table
Access control
This is done through roles.
1. create a role first
create role junior_user
2. assign role to a user name
grant junior_user to user johnsmith
3. give access to the role
grant usage on database MyDb to role junior_user
Share data, marketplace
A good thing about Snowflake is there are so many different datasets in the marketplace free or non-free.
This makes some data analysis easier as many open datasets or specific datasets are already available.
You can share your own database / table outbound as well for free or to make money.