Snowflake Tutorials - Adam Morton
This blogpost is a brief summary of my learnings from the series of videos on “Snowflake” by Adam Morton
What is Snowflake?
- Removes lot of constraints
- Performance
- Multi petabyte-scale up to 200x faster performance and 1/10th of the cost
- Concurrency
- Multiple groups access data simultaneously with no performance degradation
- Simplicity
- Key features
- Simplicity
- Fully managed with no knobs or tuning required
- No indexes, distribution keys, partitioning, vacuuming
- Zero infrastructure costs
- Zero admin costs
- Broad ecosystem of tools
- Tableau
- PowerBI
- Intuitive dashboards
- Apply your existing SQL skills
- Recreate environments in seconds - Zero copy cloning
- Bring all your data
- Simplicity
Access Control in Snowflake
- There are 5 pre-defined system roles
- AccountAdmin
- Sysadmin: This role can create warehouses and databases and all objects within a database
- Securityadmin: This role is designed to cater for the administration of security
- Useradmin: The role is used for creating roles and users and managing the privileges assigned to them
- Public
- As a user, you could be a member of more thanjust one role, which is typical across a range of applications
- Snowflake forces you to select a role before you run any queries
- Hierarchy
- AccountAdmin
- SysAdmin
- Securityadmin
- Useradmin
- Public
- AccountAdmin
- Roles can also be assigned to other roles, resulting in a role hierarchy
- Role based access control
- When a user creates an object in Snowflake, they effectively own that object and can grant access to it. This is called Discretionary Access Control
- RBAC then comes into play as the owner grants privileges on the objects to specific object
How to copy data into snowflake tables?
- Stages
- You need to consider where best you will ‘stage’ your data
- There are two stages - external stage and internal stage
- External stage
- When you create an external stage in snowflake you can think of it like a pointer to a 3rd party cloud storage location
- When you create an external stage in snowflake this creates an object within the selected schema
- External tables
- Contained within your external stage
- Hold metadata which tells snowflake where to locate those data files
- This approach allows data to sit outside of Snowflake but appears to users like this data resides within Snowflake
- Internal Stages focus on data within Snowflake
- User stage - allocated to each user
- Table stage - Each table has a table stage associated with it
- A named internal stage - all data files within snowflake
- File Formats
- File format provides informatio to Snowflake on how to interpret the incoming format
- It’s best fo define a file format once
- COPY INTO command supports basic light weight transformations
Loading Data in to Snowflake
- You will want to take advantage of Snowflake’s ability to load data in parallel.
- Snowflake recommend you should aim to produce files of around 100-250MB compressed
- The number and capacity of servers in a virtual warehouse influences how many files you can load in parallel
- Encode the file in ‘UTF-8’ format if possible
- All semi-structured data in Snowflake will be held in a variant column in the target table
- Snowflake will attempt to trat semi-structured data in the same way as relational data behind the scenes by compressing and storing data into a columnar format
- It is common to have dedicated VWs which supply the resources for your data pipelines and bulk data loads
- Metadata is tracked and stored for 64 days
How to use Cloning in Snowflake ?
- support the ease of creation of development or test environments as part of an overall development lifecycle approach
- at times you want to have a requirement of production data along with some additional columns derived from the data in the warehouse
- Another use case in for discovery analytics
- Cloned objects are writable
- You create copies of database schemas or tables without needing to move any data
- The following objects can be cloned
- Databases
- Schemas
- Tables
- Streams
- External named stages
- File formats
- Tasks
- Pipes
- What cannot be cloned ?
- Internal snowflake stages
- Internal snowflake pipes
- Views - You cannot clone a view directly
- Cloning allow you to selectively create a duplicate object under a different name
- Simple to use and super fast
- it is cost efficient and writable
- You can use it for a number of use cases
Streams and Tasks in Snowflake
- Setting up and ingesting data from continuous data pipelines creates new
challenges
- How to identify new and changed data ?
- Where best to carry out transforms by applying business logic ?
- How can you orchestrate the load and manage dependencies ?
- How can you ensure reliability in the cloud ?
- Streams and tasks are usually used in tandem
- Let’s say you want to track changes on that table, in order to merge those incremental changes into a target table
- A stream object tracks any DML operations against the source table
- Change Data Capture isn’t new but has become far easier to implement today
- Primary benefits of using CDC is to help streamline the movement of data
- What happens when you create a stream ?
- A pair of hidden columns are added to the source table which begin to store change tracking metadata
- A snapshot of the sourcetable is logically created that serves as a baseline. This is referred to as offset
- Change table stores both hte before and after record between two points in time
- The change table mirrors the structure of the source table along with the addition of some very handy change tracking columns
- Multiple queries can query the same changed records from a stream without changing the offset
- Tasks
- Sequence of steps you will need to apply to a specific order
- Tasks execute a single SQL statement giving you the ability to change together a series of tasks and dependencies
- A typical pattern would be to have a task running every 10 minutes which uses the stream to identify those new and changed records, applies transformations using stored procedures
- The data is then ready to be used in queries, dashboards and analytical models
- We will need to call resources from a virtual warehouse
- One parent task called root task
- Root task could also have a schedule associated with it
- Child tasks can then be chained together
- Streams allow you to easily capture changes on a source table
- Tasks allow you to link a series of SQL queries together
- Combining streams ans tasks is powerful and commonplace
Micro-partitioning and Clustering
- Snowflakes stores data in small blocks of data called micro-partitions
- relatively small blocks of storage that sit on the underlying cloud provider’s data store
- As data lands into snowflake a few key things happen which are completely transparent to the user. It is these operations which create the solid foundation, which, in turn enables lightening fast query performance
- You can break the operations in to
- Divide and map the incoming data using the ordering of data
- Compress the data
- Capture and store metadata
- Benefits of micro-partitions
- millions of micro partitions
- granularity brings additional flexibility allowing for finer grain query processing
- Similar to looking for information in a book - search a random page
- Same approach holds true for semi structured data
- Snowflake will attempt to convert the paths within the data into columns under the covers in an opportunistic fashion to support optimization
- Micropartitions are arranged using a clustering key which is automatically
decided by Snowflake
- Concept is known as data clustering and is defined automatically as data is loaded into snowflake
- When your tables grow in to large, this cause natural degradation of the natural clustering over time and ultimately query performance
- Having an effective clustering key allows a query to scan less micro-partitions, return less data into the cache and return latency to the user or application
- When the usage pattern is against your data is different to the order the data is loaded - use clustering
- Query pruning
- The process of narrowing down a query to only read what is absolutely required to satisfy the query is referred to as ‘pruning’
- The efficiency of pruning can be observed by comparing Partitions scanned and Partitions total statistics in the table scan operators within the query profile
- The wider the gap between the number of partitions scanned to the total partitions, the better
- Clustering depth of a table provides a values - which tells you how many
micro-paritions contain data from a column in a table
- The close to the number is to 1, the better the data is clustered