This blogpost is a brief summary of my learnings from a few videos that talk about high-level architecture of “Snowflake”

Snowflake Architecture

Snowflake is a date warehouse that runs entirely on cloud. It is available in AWS, Azure and GCP
It is not a SQL database
There is no PK/FK
Snowflake SQL allows
- DDL
- SQL Functions
- UDF
- Views
Allows connection from a set of data integration tools
Self service BI tools
Big Data tools - Kafka
JDBC/OCBC drives
Native connections for Python , Go, Node.js
SQL Interface and Client
- Snowflake Web interface
- Snowflake CLI
- DBeaver
What makes snowflake different ?
- There are no multiple knobs for database performance
- Tunable pay per use
Separation of storage and compute
- Snowflake Database layer
  - Create Database
  - Create Schema
  - Create Table
  - Load Data in Tables
  - All tables are stored in S3
- You pay compute costs for creating database, schema and table structure
- Data Load job takes one hour - You pay for one hour
- If your queries take 3 hours per day - You pay 3 hours each day
Like VMs there are Virtual Warehouses in tshirt sizes
Creating VW does not cost anything
- You can create VW based on specific tasks that you want to do
Pay compute cost when a VW is running
Multiple concurrent queries can be submitted to a VW
Additional queries will be queued if the VW is busy
Multi-cluster VW for auto scaling
- If you start a query in VW, it can scale up or scale down on demand
Three layers
- Storage Layer
- Compute Layer
- Service Layer
  - Metadata
  - Security
  - Infra and DB
  - Optimizer
- Main cost is storage and compute
Separation of storage and compute is available in several products
Snowflake is not unique just because storage and compute
Snowflake uses S3 for database usage. S3 challenges
- I/O Latency
- CPU overhead - Only overwrite files - No updates
- Object storage
Features of S3
- High availability
- Durability
- API to read data parts
  - powerful feature if you have an index to read exactly what you need.
If you have a large table, then the table is broken in to micro partitions 50MB-500M of uncompressed data
- Reorganize the data so that the data is columnar
- Compress the data
- Header contains metadata such as unique values, range and column offsets
Metadata and stats about micro partitions in S3
Instead of reading a complete file, read only those values that your query wants to read
- Part of file using offset and length
Snowflake stores all the metadata about each partition
- It knows which partition the desired data is stored
Generic partitioning works
When you already know the commonly used column patterns, then you can specify the clustering keys

How is snowflake different from AWS ?

Compute and storage are complete and the storage cost is the same as storing the data on S3. AWS does it with Redshift Spectrum but it is not as seamless as with Snowflake
Snowflake is a data platform and data warehouse that supports the most common standardized version of SQL. This means that all of the most common operations are usable within Snowflake
Snowflake has advantages over NOSQL databases like Cassandra
- As snowflake loads semi-structured data, it records metadata which then used in query plans and query executions, providing optimal performance and allowing for the querying of semi-structured data using common SQL
Snowflake provides a data warehouse that is faster, easier to user, and far more flexible. Unique architecture
Built on the top of AWS. No hardware virtual or physical to select, install and configure
Each shape of snowflake is differently
Snowflake and ETF tools snowflake supports both ETL and ELT. Snowflake works with a wide range of data integration tools

Why everyone cares about snowflake?

Why is everyone competing with it or want to partner with it ?
In the dark ages, getting a data warehouse was time consuming
Enter the cloud - RedShift
- Very popular option
- Lot of complexity - Better than On prem solution
- Basic changes of data needed more steps
Hortonworks - Managed hadoop
Snowflake did a lot of things well when it came to market
- Offered virtual warehouse
- Storage and compute separate
- No need to resize the warehouse
- No need to spend time on migrating
Marketing and sales were amazing
Opened up the world of data to small companies
Cloud VW - easy to spin up
Average cost incurred by snowflake customer is 160k usd per year
Snowflake acquired streamlit for 800 million dollars in Mar 2022
Data platform providers becoming a data analytics company

Snowflake architecture

Separation of storage and compute
Compute layer is broken down in to virtual warehouses
Several warehouses can access the data
Easy to scale up or scale down the virtual warehouses
Cloud Services layer does all the work of data warehousing systems

Should you switch to snowflake ?

Big Query (GCP) - Synapse(Azure) - Redshift(AWS) - All are tied to a cloud provider
Snowflake is cloud agnostic
Hadoop needs an infrastructure
Snowflake needs a small team
Hybrid of Shared disk and Shared nothing architecture
Snowpipe - ELT
Filestorage on a existing cloud platform
Many ways to
Quality of life improvements in the data warehousing
Testing by switching the role
Monitoring and alerting
Time Travel
Task Scheduler
No SQL agent
Performance Tuning
Handles structured and semi-structured data
Environmental set up

Understanding Snowflake Data Platform for beginners

Historically storage was tightly couples. You couldn’t have one without the other
This leads to under or over provisioning
Decoupling storage and compute
Architecture can be broken down in to
- Database storage - Reorganizes structured and unstructured data into its internally optimized, compressed, columnar storage
- Query processing - Snowflake uses compute and memory resource provided y virtual warehouses to execute and process queries
- Cloud services - a collection of supporting services which coordinates activities across the platform, from users logging to query optimization
Think of architecture as a postal service
- Database Storage: Think of it as Mail room
  - Some parcels are large, some parcels are small
- Compute: Courier - Delivery man
  - It handles query
- Virtual warehouse
  - Delivery mechanism - sometimes you need a small car, sometimes you need a heavy vehicle
- Cloud Services: The boss who keeps the system running
Data is stored in micro partitions
- 50MB to 500 MB blocks of storage
- These blocks sit on the underlying cloud provider’s data store be on S3/GCS/Blog storage
- Divide and map the incoming data in to micro partitions
- compress the data
- metadata stored
Benefits of micro partitions
- There could be millions of micro partitions
- Granularity brings additional flexibility allowing for finer grain query pruning
- Metadata associated with the micro partitions allows snowflake to optimize the most expensive area of data processing
The Job of query processing services it to take queries and pass them to the Cloud services layer
VW
- Bundle of compute resources and memory
- You need a VW to perform pretty much anything
Caching feature of VW
Cloud Services
- Ties everything together
- Brains of the operation
- Authentication
- Infrastructure management
- Metadata management
- Query parsing and execution
- Access Control