Snowflake Overview - Learnings
This blogpost is a brief summary of my learnings from a few videos that talk about high-level architecture of “Snowflake”
Snowflake Architecture
- Snowflake is a date warehouse that runs entirely on cloud. It is available in AWS, Azure and GCP
- It is not a SQL database
- There is no PK/FK
- Snowflake SQL allows
- DDL
- SQL Functions
- UDF
- Views
- Allows connection from a set of data integration tools
- Self service BI tools
- Big Data tools - Kafka
- JDBC/OCBC drives
- Native connections for Python , Go, Node.js
- SQL Interface and Client
- Snowflake Web interface
- Snowflake CLI
- DBeaver
- What makes snowflake different ?
- There are no multiple knobs for database performance
- Tunable pay per use
- Separation of storage and compute
- Snowflake Database layer
- Create Database
- Create Schema
- Create Table
- Load Data in Tables
- All tables are stored in S3
- You pay compute costs for creating database, schema and table structure
- Data Load job takes one hour - You pay for one hour
- If your queries take 3 hours per day - You pay 3 hours each day
- Snowflake Database layer
- Like VMs there are Virtual Warehouses in tshirt sizes
- Creating VW does not cost anything
- You can create VW based on specific tasks that you want to do
- Pay compute cost when a VW is running
- Multiple concurrent queries can be submitted to a VW
- Additional queries will be queued if the VW is busy
- Multi-cluster VW for auto scaling
- If you start a query in VW, it can scale up or scale down on demand
- Three layers
- Storage Layer
- Compute Layer
- Service Layer
- Metadata
- Security
- Infra and DB
- Optimizer
- Main cost is storage and compute
- Separation of storage and compute is available in several products
- Snowflake is not unique just because storage and compute
- Snowflake uses S3 for database usage. S3 challenges
- I/O Latency
- CPU overhead - Only overwrite files - No updates
- Object storage
- Features of S3
- High availability
- Durability
- API to read data parts
- powerful feature if you have an index to read exactly what you need.
- If you have a large table, then the table is broken in to micro partitions
50MB-500M of uncompressed data
- Reorganize the data so that the data is columnar
- Compress the data
- Header contains metadata such as unique values, range and column offsets
- Metadata and stats about micro partitions in S3
- Instead of reading a complete file, read only those values that your query
wants to read
- Part of file using offset and length
- Snowflake stores all the metadata about each partition
- It knows which partition the desired data is stored
- Generic partitioning works
- When you already know the commonly used column patterns, then you can specify the clustering keys
How is snowflake different from AWS ?
- Compute and storage are complete and the storage cost is the same as storing the data on S3. AWS does it with Redshift Spectrum but it is not as seamless as with Snowflake
- Snowflake is a data platform and data warehouse that supports the most common standardized version of SQL. This means that all of the most common operations are usable within Snowflake
- Snowflake has advantages over NOSQL databases like Cassandra
- As snowflake loads semi-structured data, it records metadata which then used in query plans and query executions, providing optimal performance and allowing for the querying of semi-structured data using common SQL
- Snowflake provides a data warehouse that is faster, easier to user, and far more flexible. Unique architecture
- Built on the top of AWS. No hardware virtual or physical to select, install and configure
- Each shape of snowflake is differently
- Snowflake and ETF tools snowflake supports both ETL and ELT. Snowflake works with a wide range of data integration tools
Why everyone cares about snowflake?
- Why is everyone competing with it or want to partner with it ?
- In the dark ages, getting a data warehouse was time consuming
- Enter the cloud - RedShift
- Very popular option
- Lot of complexity - Better than On prem solution
- Basic changes of data needed more steps
- Hortonworks - Managed hadoop
- Snowflake did a lot of things well when it came to market
- Offered virtual warehouse
- Storage and compute separate
- No need to resize the warehouse
- No need to spend time on migrating
- Marketing and sales were amazing
- Opened up the world of data to small companies
- Cloud VW - easy to spin up
- Average cost incurred by snowflake customer is 160k usd per year
- Snowflake acquired streamlit for 800 million dollars in Mar 2022
- Data platform providers becoming a data analytics company
Snowflake architecture
- Separation of storage and compute
- Compute layer is broken down in to virtual warehouses
- Several warehouses can access the data
- Easy to scale up or scale down the virtual warehouses
- Cloud Services layer does all the work of data warehousing systems
Should you switch to snowflake ?
- Big Query (GCP) - Synapse(Azure) - Redshift(AWS) - All are tied to a cloud provider
- Snowflake is cloud agnostic
- Hadoop needs an infrastructure
- Snowflake needs a small team
- Hybrid of Shared disk and Shared nothing architecture
- Snowpipe - ELT
- Filestorage on a existing cloud platform
- Many ways to
- Quality of life improvements in the data warehousing
- Testing by switching the role
- Monitoring and alerting
- Time Travel
- Task Scheduler
- No SQL agent
- Performance Tuning
- Handles structured and semi-structured data
- Environmental set up
Understanding Snowflake Data Platform for beginners
- Historically storage was tightly couples. You couldn’t have one without the other
- This leads to under or over provisioning
- Decoupling storage and compute
- Architecture can be broken down in to
- Database storage - Reorganizes structured and unstructured data into its internally optimized, compressed, columnar storage
- Query processing - Snowflake uses compute and memory resource provided y virtual warehouses to execute and process queries
- Cloud services - a collection of supporting services which coordinates activities across the platform, from users logging to query optimization
- Think of architecture as a postal service
- Database Storage: Think of it as Mail room
- Some parcels are large, some parcels are small
- Compute: Courier - Delivery man
- It handles query
- Virtual warehouse
- Delivery mechanism - sometimes you need a small car, sometimes you need a heavy vehicle
- Cloud Services: The boss who keeps the system running
- Database Storage: Think of it as Mail room
- Data is stored in micro partitions
- 50MB to 500 MB blocks of storage
- These blocks sit on the underlying cloud provider’s data store be on S3/GCS/Blog storage
- Divide and map the incoming data in to micro partitions
- compress the data
- metadata stored
- Benefits of micro partitions
- There could be millions of micro partitions
- Granularity brings additional flexibility allowing for finer grain query pruning
- Metadata associated with the micro partitions allows snowflake to optimize the most expensive area of data processing
- The Job of query processing services it to take queries and pass them to the Cloud services layer
- VW
- Bundle of compute resources and memory
- You need a VW to perform pretty much anything
- Caching feature of VW
- Cloud Services
- Ties everything together
- Brains of the operation
- Authentication
- Infrastructure management
- Metadata management
- Query parsing and execution
- Access Control