snowflake get started

Snowflake

Snowflake warehouse is created from snowflake's own site. On creation, you can choose the cloud platform to run it, azure, aws or gcp

One key feature of Snowflake is the decoupling of storage and compute.

Architecture

The architecture can be broken into 3 areas:

1. Database storage, unstructured and unstructured data in internally optimized columnar format.

Snowflake use Micro Partitions at the bottom level to store data.

This is a bit like HDFS, small blocks (50 -500MB) are used as basic storage units.

The blocks can sit on Azure BLob or AWS S3, etc.

When new data comes in, it is divided and mapped to micro partitions using the ordering of the data as it is loaded. Then the data is compressed. The metadata is generated and stored about the new data.

The benefit of micro partition is bringing down the query granularity. The metadata helps to target only the required blocks instead of scanning the whole storage.

2. query processing, using the compute resources by virtual warehouse to run queries.

It needs Virtual Warehouse (compute and memory) to run queries (DDL / DML / loading data). It caches query results for future use as well.

3. cloud services, supporting services coordinates activities across the platform. this puts everything together.

It runs infrastructure management (eg. cloud, block stuff), metadata management (data attributes and statistics), query parsing and execution (execution plan), authentication &. access control

Some notes

Snowflake itself is a data warehouse / data platform, it has no ETL tool built in.

same data can be run from different virtual data warehouses for different purposes.

snowflake can read data straight from data lake.

In Snowflake, you can use the databases / storage shared by someone else (usually read-only). You don't own or pay for the storage, but you can use your own compute to query and use the database. When sharing (inbound or outbound), one can choose the database, the selected tables and views to be shared.

Snowflake can spawn multiple clusters (not just multiple nodes) to service unexpected workload. There is also auto suspend if idle for more than period, and auto resume as well. This is more handy than azure sql that needs explicit pause/resume.

Some limited charts (line, scatte,r bar, scorecard) are supported in the query results. Those charts can be shared with other snowflake users as well.

In snowflake, the marketplace allows people to share / sell datasets/bases.

Create+

Once selected a database and schema, the top right shows a Create button, which can create heaps of stuff for the schema, e.g.

table, view, strage, storeage integration, file format, sequence, Pipe, Stream, Tas, Function, Procedure, Dynamic table.

Table,

simply database table, from DML, or external, or a file, or a select statement.

e.g. from external, can point to an azure block storage directory

a file format must be specified, even same format like csv can have different delimiters, quotations, etc.

on the back, the creation of table from external is essentially a COPY statement, i.e., copy to table from external storage.

View

can be standard or materialized view

Stage

external table is a snowflake feature allowing you to query data stored in an external stage as if the data was inside a table in snowflake.

Other tables require internal stage (storage) in snowflake.

where creating stage, (selecting the database and schema, top right -> create - > stage), there can be Snowflake managed stage, or external stage from S3, Azure, or GCP

stage is listed under database/schema, next to tables

to map an azure blob container as stage, firstly go to azure blob's container-> shared access token

generate an account key, select all the permissions (read, list, write, etc.), click Generate SAS token and URL

copy the URL back to the 'create stage' UI, it may replaces the https with azure:, make the token is in as well. Once created, under the stage UI in snowflake, a list of files folders in the container is listed.

Stage integration

Simply saves the token, authentication, etc. for connecting to external cloud storage (azure, aws,..)

It can be used for creating Stage, so doesn't need to copy the token and url every time.

File Format

type = { CSV | JSON | AVRO | ORC | PARQUET | XML }

Sequence

it's like a range() in python, with start, increment and comment.

I guess its for creating a sequence for joining to other data. e.g. year 1900 to 2000.

Pipe

simply Copy statement for copying data files from stage to a table.

Stream

Not really a data stream, but it records DML (data manipulation) changes made to a table.

It's more like CDC in sql server, stream itself does not contain any table data. A stream only stores an offset for the source object and returns CDC records by leveraging the versioning history for the source object. One example of a consumer of CDC records is a data pipeline, in which only the data in staging tables that has changed since the last extraction is transformed and copied into other tables.

Task

simply a scheduler that runs sql / stored procedure

Function/procedure

database function / stored procedure

Dynamic table

Similar to materialized view, runs a sql and surface the result like a table.

However, materialized view supports only a single table, but dynamic table supports complex joins and unions of diff tables.

Warehouse

The compute resource is associated with warehouse. When run a worksheet, it needs specified the warehouse to run with.

The data / database however, lives without reliance on the compute / warehouse.

Use statements

use role sysadmin //the user role

use warehouse compute_wh //select the compute resource by name

use database abc //db

use schema dbo //schema

Select Json data / semi-structured data

say a json table saves raw json text string, the data type is 'variant'.

snowflake can select json elements in the select statement, e.g.

select

col_name:weather[0].name::string as name,

col_name:wind.speed:float as speed

from json_table

Access control

This is done through roles.

1. create a role first

create role junior_user

2. assign role to a user name

grant junior_user to user johnsmith

3. give access to the role

grant usage on database MyDb to role junior_user

Share data, marketplace

A good thing about Snowflake is there are so many different datasets in the marketplace free or non-free.

This makes some data analysis easier as many open datasets or specific datasets are already available.

You can share your own database / table outbound as well for free or to make money.

Page updated

Google Sites

Report abuse