This blogpost is a brief summary of my learnings from the series of videos on “Snowflake” by Adam Morton

What is Snowflake?

  • Removes lot of constraints
  • Performance
    • Multi petabyte-scale up to 200x faster performance and 1/10th of the cost
  • Concurrency
    • Multiple groups access data simultaneously with no performance degradation
  • Simplicity
  • Key features
    • Simplicity
      • Fully managed with no knobs or tuning required
      • No indexes, distribution keys, partitioning, vacuuming
      • Zero infrastructure costs
      • Zero admin costs
    • Broad ecosystem of tools
      • Tableau
      • PowerBI
      • Intuitive dashboards
    • Apply your existing SQL skills
    • Recreate environments in seconds - Zero copy cloning
    • Bring all your data

Access Control in Snowflake

  • There are 5 pre-defined system roles
    • AccountAdmin
    • Sysadmin: This role can create warehouses and databases and all objects within a database
    • Securityadmin: This role is designed to cater for the administration of security
    • Useradmin: The role is used for creating roles and users and managing the privileges assigned to them
    • Public
  • As a user, you could be a member of more thanjust one role, which is typical across a range of applications
  • Snowflake forces you to select a role before you run any queries
  • Hierarchy
    • AccountAdmin
      • SysAdmin
      • Securityadmin
        • Useradmin
      • Public
  • Roles can also be assigned to other roles, resulting in a role hierarchy
  • Role based access control
  • When a user creates an object in Snowflake, they effectively own that object and can grant access to it. This is called Discretionary Access Control
  • RBAC then comes into play as the owner grants privileges on the objects to specific object

How to copy data into snowflake tables?

  • Stages
    • You need to consider where best you will ‘stage’ your data
    • There are two stages - external stage and internal stage
    • External stage
      • When you create an external stage in snowflake you can think of it like a pointer to a 3rd party cloud storage location
      • When you create an external stage in snowflake this creates an object within the selected schema
    • External tables
      • Contained within your external stage
      • Hold metadata which tells snowflake where to locate those data files
      • This approach allows data to sit outside of Snowflake but appears to users like this data resides within Snowflake
    • Internal Stages focus on data within Snowflake
      • User stage - allocated to each user
      • Table stage - Each table has a table stage associated with it
      • A named internal stage - all data files within snowflake
    • File Formats
      • File format provides informatio to Snowflake on how to interpret the incoming format
      • It’s best fo define a file format once
    • COPY INTO command supports basic light weight transformations

Loading Data in to Snowflake

  • You will want to take advantage of Snowflake’s ability to load data in parallel.
  • Snowflake recommend you should aim to produce files of around 100-250MB compressed
  • The number and capacity of servers in a virtual warehouse influences how many files you can load in parallel
  • Encode the file in ‘UTF-8’ format if possible
  • All semi-structured data in Snowflake will be held in a variant column in the target table
  • Snowflake will attempt to trat semi-structured data in the same way as relational data behind the scenes by compressing and storing data into a columnar format
  • It is common to have dedicated VWs which supply the resources for your data pipelines and bulk data loads
  • Metadata is tracked and stored for 64 days

How to use Cloning in Snowflake ?

  • support the ease of creation of development or test environments as part of an overall development lifecycle approach
  • at times you want to have a requirement of production data along with some additional columns derived from the data in the warehouse
  • Another use case in for discovery analytics
  • Cloned objects are writable
  • You create copies of database schemas or tables without needing to move any data
  • The following objects can be cloned
    • Databases
    • Schemas
    • Tables
    • Streams
    • External named stages
    • File formats
    • Tasks
    • Pipes
  • What cannot be cloned ?
    • Internal snowflake stages
    • Internal snowflake pipes
    • Views - You cannot clone a view directly
  • Cloning allow you to selectively create a duplicate object under a different name
  • Simple to use and super fast
  • it is cost efficient and writable
  • You can use it for a number of use cases

Streams and Tasks in Snowflake

  • Setting up and ingesting data from continuous data pipelines creates new challenges
    • How to identify new and changed data ?
    • Where best to carry out transforms by applying business logic ?
    • How can you orchestrate the load and manage dependencies ?
    • How can you ensure reliability in the cloud ?
  • Streams and tasks are usually used in tandem
  • Let’s say you want to track changes on that table, in order to merge those incremental changes into a target table
  • A stream object tracks any DML operations against the source table
  • Change Data Capture isn’t new but has become far easier to implement today
  • Primary benefits of using CDC is to help streamline the movement of data
  • What happens when you create a stream ?
    • A pair of hidden columns are added to the source table which begin to store change tracking metadata
    • A snapshot of the sourcetable is logically created that serves as a baseline. This is referred to as offset
  • Change table stores both hte before and after record between two points in time
  • The change table mirrors the structure of the source table along with the addition of some very handy change tracking columns
  • Multiple queries can query the same changed records from a stream without changing the offset
  • Tasks
    • Sequence of steps you will need to apply to a specific order
    • Tasks execute a single SQL statement giving you the ability to change together a series of tasks and dependencies
    • A typical pattern would be to have a task running every 10 minutes which uses the stream to identify those new and changed records, applies transformations using stored procedures
    • The data is then ready to be used in queries, dashboards and analytical models
    • We will need to call resources from a virtual warehouse
    • One parent task called root task
    • Root task could also have a schedule associated with it
    • Child tasks can then be chained together
  • Streams allow you to easily capture changes on a source table
  • Tasks allow you to link a series of SQL queries together
  • Combining streams ans tasks is powerful and commonplace

Micro-partitioning and Clustering

  • Snowflakes stores data in small blocks of data called micro-partitions
  • relatively small blocks of storage that sit on the underlying cloud provider’s data store
  • As data lands into snowflake a few key things happen which are completely transparent to the user. It is these operations which create the solid foundation, which, in turn enables lightening fast query performance
  • You can break the operations in to
    • Divide and map the incoming data using the ordering of data
    • Compress the data
    • Capture and store metadata
  • Benefits of micro-partitions
    • millions of micro partitions
    • granularity brings additional flexibility allowing for finer grain query processing
    • Similar to looking for information in a book - search a random page
  • Same approach holds true for semi structured data
  • Snowflake will attempt to convert the paths within the data into columns under the covers in an opportunistic fashion to support optimization
  • Micropartitions are arranged using a clustering key which is automatically decided by Snowflake
    • Concept is known as data clustering and is defined automatically as data is loaded into snowflake
  • When your tables grow in to large, this cause natural degradation of the natural clustering over time and ultimately query performance
  • Having an effective clustering key allows a query to scan less micro-partitions, return less data into the cache and return latency to the user or application
  • When the usage pattern is against your data is different to the order the data is loaded - use clustering
  • Query pruning
    • The process of narrowing down a query to only read what is absolutely required to satisfy the query is referred to as ‘pruning’
    • The efficiency of pruning can be observed by comparing Partitions scanned and Partitions total statistics in the table scan operators within the query profile
    • The wider the gap between the number of partitions scanned to the total partitions, the better
  • Clustering depth of a table provides a values - which tells you how many micro-paritions contain data from a column in a table
    • The close to the number is to 1, the better the data is clustered