This blogpost is a brief summary of my learnings from the series of videos on “Snowflake” by Adam Morton

What is Snowflake?

Removes lot of constraints
Performance
- Multi petabyte-scale up to 200x faster performance and 1/10th of the cost
Concurrency
- Multiple groups access data simultaneously with no performance degradation
Simplicity
Key features
- Simplicity
  - Fully managed with no knobs or tuning required
  - No indexes, distribution keys, partitioning, vacuuming
  - Zero infrastructure costs
  - Zero admin costs
- Broad ecosystem of tools
  - Tableau
  - PowerBI
  - Intuitive dashboards
- Apply your existing SQL skills
- Recreate environments in seconds - Zero copy cloning
- Bring all your data

Access Control in Snowflake

There are 5 pre-defined system roles
- AccountAdmin
- Sysadmin: This role can create warehouses and databases and all objects within a database
- Securityadmin: This role is designed to cater for the administration of security
- Useradmin: The role is used for creating roles and users and managing the privileges assigned to them
- Public
As a user, you could be a member of more thanjust one role, which is typical across a range of applications
Snowflake forces you to select a role before you run any queries
Hierarchy
- AccountAdmin
  - SysAdmin
  - Securityadmin
    - Useradmin
  - Public
Roles can also be assigned to other roles, resulting in a role hierarchy
Role based access control
When a user creates an object in Snowflake, they effectively own that object and can grant access to it. This is called Discretionary Access Control
RBAC then comes into play as the owner grants privileges on the objects to specific object

How to copy data into snowflake tables?

Stages
- You need to consider where best you will ‘stage’ your data
- There are two stages - external stage and internal stage
- External stage
  - When you create an external stage in snowflake you can think of it like a pointer to a 3rd party cloud storage location
  - When you create an external stage in snowflake this creates an object within the selected schema
- External tables
  - Contained within your external stage
  - Hold metadata which tells snowflake where to locate those data files
  - This approach allows data to sit outside of Snowflake but appears to users like this data resides within Snowflake
- Internal Stages focus on data within Snowflake
  - User stage - allocated to each user
  - Table stage - Each table has a table stage associated with it
  - A named internal stage - all data files within snowflake
- File Formats
  - File format provides informatio to Snowflake on how to interpret the incoming format
  - It’s best fo define a file format once
- COPY INTO command supports basic light weight transformations

Loading Data in to Snowflake

You will want to take advantage of Snowflake’s ability to load data in parallel.
Snowflake recommend you should aim to produce files of around 100-250MB compressed
The number and capacity of servers in a virtual warehouse influences how many files you can load in parallel
Encode the file in ‘UTF-8’ format if possible
All semi-structured data in Snowflake will be held in a variant column in the target table
Snowflake will attempt to trat semi-structured data in the same way as relational data behind the scenes by compressing and storing data into a columnar format
It is common to have dedicated VWs which supply the resources for your data pipelines and bulk data loads
Metadata is tracked and stored for 64 days

How to use Cloning in Snowflake ?

support the ease of creation of development or test environments as part of an overall development lifecycle approach
at times you want to have a requirement of production data along with some additional columns derived from the data in the warehouse
Another use case in for discovery analytics
Cloned objects are writable
You create copies of database schemas or tables without needing to move any data
The following objects can be cloned
- Databases
- Schemas
- Tables
- Streams
- External named stages
- File formats
- Tasks
- Pipes
What cannot be cloned ?
- Internal snowflake stages
- Internal snowflake pipes
- Views - You cannot clone a view directly
Cloning allow you to selectively create a duplicate object under a different name
Simple to use and super fast
it is cost efficient and writable
You can use it for a number of use cases

Streams and Tasks in Snowflake

Setting up and ingesting data from continuous data pipelines creates new challenges
- How to identify new and changed data ?
- Where best to carry out transforms by applying business logic ?
- How can you orchestrate the load and manage dependencies ?
- How can you ensure reliability in the cloud ?
Streams and tasks are usually used in tandem
Let’s say you want to track changes on that table, in order to merge those incremental changes into a target table
A stream object tracks any DML operations against the source table
Change Data Capture isn’t new but has become far easier to implement today
Primary benefits of using CDC is to help streamline the movement of data
What happens when you create a stream ?
- A pair of hidden columns are added to the source table which begin to store change tracking metadata
- A snapshot of the sourcetable is logically created that serves as a baseline. This is referred to as offset
Change table stores both hte before and after record between two points in time
The change table mirrors the structure of the source table along with the addition of some very handy change tracking columns
Multiple queries can query the same changed records from a stream without changing the offset
Tasks
- Sequence of steps you will need to apply to a specific order
- Tasks execute a single SQL statement giving you the ability to change together a series of tasks and dependencies
- A typical pattern would be to have a task running every 10 minutes which uses the stream to identify those new and changed records, applies transformations using stored procedures
- The data is then ready to be used in queries, dashboards and analytical models
- We will need to call resources from a virtual warehouse
- One parent task called root task
- Root task could also have a schedule associated with it
- Child tasks can then be chained together
Streams allow you to easily capture changes on a source table
Tasks allow you to link a series of SQL queries together
Combining streams ans tasks is powerful and commonplace

Micro-partitioning and Clustering

Snowflakes stores data in small blocks of data called micro-partitions
relatively small blocks of storage that sit on the underlying cloud provider’s data store
As data lands into snowflake a few key things happen which are completely transparent to the user. It is these operations which create the solid foundation, which, in turn enables lightening fast query performance
You can break the operations in to
- Divide and map the incoming data using the ordering of data
- Compress the data
- Capture and store metadata
Benefits of micro-partitions
- millions of micro partitions
- granularity brings additional flexibility allowing for finer grain query processing
- Similar to looking for information in a book - search a random page
Same approach holds true for semi structured data
Snowflake will attempt to convert the paths within the data into columns under the covers in an opportunistic fashion to support optimization
Micropartitions are arranged using a clustering key which is automatically decided by Snowflake
- Concept is known as data clustering and is defined automatically as data is loaded into snowflake
When your tables grow in to large, this cause natural degradation of the natural clustering over time and ultimately query performance
Having an effective clustering key allows a query to scan less micro-partitions, return less data into the cache and return latency to the user or application
When the usage pattern is against your data is different to the order the data is loaded - use clustering
Query pruning
- The process of narrowing down a query to only read what is absolutely required to satisfy the query is referred to as ‘pruning’
- The efficiency of pruning can be observed by comparing Partitions scanned and Partitions total statistics in the table scan operators within the query profile
- The wider the gap between the number of partitions scanned to the total partitions, the better
Clustering depth of a table provides a values - which tells you how many micro-paritions contain data from a column in a table
- The close to the number is to 1, the better the data is clustered