Rise of Data Cloud - Book Review
Contents
The book titled, “Rise of Data Cloud” is the story behind the snowflakes, a cloud computing based data warehousing company.
It was almost a year ago that I had read this book and my experience was rather shallow as I had never worked on Snowflake platform. In the recent few months, I have started using Snowflake for a project and hence decided to reread this book. Rereading the book after knowing how to work with Snowflake is a completely different experience. The terms explained, the historical significance of various architectural components of the platform was far more interesting to know. In this brief blogpost, I will list down a few points from various chapters of the book
The Big Idea
- The seeds of the company were developed an planted by Mike Speiser, an MD at Sutter Hill Ventures
- Mike was the founding investor for Pure Storage, a company that pioneered in replacing disk storage devices in enterprise data centers with faster, more reliable, and more compact solid-state storage devices called flash storage
- Mike thought there was a massive opportunity to run traditional databases to run on flash
- Mike meets Benoit Dageville who says that the idea is incremental but would not be something dramatic to the current infra problems
- Benoit is of the view that the right problem to be solved were cost and complexity of computing
- Teradata made a ton of money selling data warehouses that could large volume of varied data types
- A startup might be able to double I/O performance in the database but it might bot be able to improve computational performance by a hundredfold
- Benoit was frustrated with his work at Oracle as most of his time was spent on fixing existing products
- Snowflake started in 2012 and the first six months, the founders locked themselves in a room and worked for 6 months
- Two important architectural decisions
- Separate storage from compute. A company would maintain just one copy of all
of the data it collected. Any number of cluster of computers would be
directed to access the same data at the same time
- charge for data in motion
- stored data would be highly compressed and inexpensive
- In conventional scenario, many software applications and database queries
tap into a cluster of computers. It’s done that way so all of the users can
access computing power concurrently. But the problem comes when too many
users are online at the same time
- Possible to assign a big computing job exclusively to an entire cluster of computers or even more than one cluster. No resource sharing is required
- Separate storage from compute. A company would maintain just one copy of all
of the data it collected. Any number of cluster of computers would be
directed to access the same data at the same time
- Names the company snowflake because the founders enjoyed skiing and snow comes from the clouds
- Other important architectural decisions
- vector processing - fetching multiple records than single
- automatic indexing: Traditional db requires users to manually and explicitly specify indexing and partition strategies. The micro-partitioning technique automatically partitions the data in smaller chunks that can be efficiently targeted
- simplicity
- No knobs to adjust
- Amazon was placing traditional database technology in the cloud, rather than re-architect for the cloud like Snowflake
- Sharing thoughts and feedback equates collective noise cancellation
- Metadata was written in FoundationDB, an open source database
Building the Company
- Smart move of charging storage and compute separately. Buy storage and compute wholesales and pass along the costs to its end customers wrapped in its own pricing package
- Abstract pricing from the number of physical nodes that were to be used to run a given task. Snowflake would sell computing credits up from that customers would use like money. The credit structure also enabled Snowflake to discount pricing creatively
- Snowflake can muster enormous amount of computing power to get things done much faster
- Storage separate from Compute and Virtual warehouses that go up and down - That’s how customers saw snowflake
- Snowflake came out of stealth mode in 2014 and made its first version available in 2015
- In the initial years it sold to large corporations such as Goldman and Capital one
- Data sharing and Data Exchange - key features
- Customers can join marketplace and they will not be charged storage fees
- Snowflake realized the power of partnerships - Microsoft Azure in 2018 and GCP in 2019
Snowflake Today
- Three stages of a company: embryonic - formative - scale
- Professional services team was formed in 2017 at Snowflake
- Snowpipe
- ETL and ELT
- All kinds of structured and unstructured data ingestion is possible
- Hundreds of analytics companies built stuff on top of Snowflake
- Snowflake marketplace
- enables you to discover, evaluate and purchase data
- there are listings for various products
- prioritize product listings
- sample data from a listing details
- Sample queries to look at the data
- build on snowgrid
- direct access to live ready to query data, data services and applications
- data is a pointer and not a copy to the original data
- consumption based
How snowflake uses its own technology
- Snowflake cloud data platform has become the tech foundation upon which the company runs its own business. It is called “Snowhouse”
- Snowhouse is used in finance, marketing, engineering and many more divisions at Snowflake
The Data Economy and the Digital Enterprise
- Just as oil drove the economic progress in the twentieth century, data is driving progress today. On one hand, that’s because there is so much of it. Just as important, though, technological advances are driving progress, enabling people to transform data into insights and insights into actions that help businesses to thrive
- The combination of so much data and so many tools gives rise to data economy
- The so-called information age got its start in 1959 with the invention of the MOS transistor at Bell Labs and flourished with the emergence of the personal computer
- IBM engineers designed the first commercial hard disk drive, which was released in 1956, it was nearly the size of a Volkswagen Beetle and had a storage capacity of 5 MB
- Each human genome contains 3 billion DNA base pairs and stores about 700MB of data
- New way to think about data
- The first data circle runs along the boundaries of your enterprise. it’s the stuff you collect from sales, inventories, customer interactions, tracking of operations. Typically they are data silos
- The second circle contains data that is owned by business partners and can be shared
- The third ring is the data available from third party data vendors
- The fourth ring is still emerging. It’s data that has not been collected yet
- Centralized data
- many problems of the copied datasets
- cannot be integrated readily
- cannot scale storage and compute
- There have been a move by enterprises to store all their data in “data lakes”. But a data lake is too much like a real lake: murky. It is hard to see what is in it and hard to hook the pieces of the data you want. Think of the cloud data warehouse as a very large aquarium. It does the job of a data lake, the storing of all types of data, but it does it in a way that makes the data more easily accessible
- Everyone can do what big boys of Data Economy have been doing
- Fewer DBAs are required
- Democratization of data analysis - we are not talking about data scientists but those who can fire SQL queries (large many in an enterprise)
- How to make data insights credible ?
- Data analytics in the Big Data era is like a mountain covered in trees that gets a lot of snow in winter. It’s an ideal spot for a ski resort. The resort developer is responsible for transforming a wilderness into a winter playground. It cuts down trees to make slopes and paths, builds lift systems, installs snow-making equipment, installs lights, and grooms the slopes after a snowfall. In any enterprise, data is the snow, employees are the skiers, and executives are the resort managers charged with getting people safely to the bottom of the slopes.
- Democratization of data
- Measure everything
The Power of Data Network effects
- Starschema, a technology services company, picked up some of the data from Johns Hopkins dataset and made the entire collection available on Snowflake’s public data exchange in its COVID-19 incidence dataset. Within days, hundreds of organizations were accessing the data
- Data network effects are the compounding benefits achieved by sharing data - combining diverse types of data from a variety of sources and making it available to others - one to one one to many or many to many
- The Data Economy is powered by data network effects
- FTP protocol has been in use since 1971 and the transfers are frustratingly slow. In addition, after the recipients get the data they still have to transform it into usable formats to get any value out of it
- APIs calls have also problem. They are limited to how much data you can transfer
- In both FTP and API scenario, copies are made of the data. That means the data is stored twice or many times, which has a cost. It also means that copies will likely become out of date and out of sync with the original data. Data governance rules might get compromised
- In data sharing, the data always resides in the owner account. The consumer points his Virtual warehouse and queries the data. There is no need to replicate the data
- Data sharers can push a button and share with anyone, anywhere in the world
- It is like a million people being able to borrow the same book from the library at the same time
- Use cases for data sharing
- Silo Buster Connected enterprise
- Data as a service
- Data Super consumer
- Rather than shopping externally for data and importing in into their databases the old-fashioned way, they set up private cloud data exchanges and invite their data providers to interact with them there. Hedge funds are asking data providers to interact via a cloud data exchange
- Coalition
- SAAS sharing
- Coalition
Data Exchanges
- Coatue Management - Hedge fund out of NY uses Snowflake data exchange
- Killer app is the data exchange. The exchange is where all the capabilities come together with the potential to help companies become truly data driven and to grease the skids for commerce and collaboration between groups of companies
- Two types of exchanges
- Public exchange
- Game changer for data providers
- Make their products available to all or specific customers
- No copying or sending files
- All datasets are up to date
- Appstore for data
- Private exchange enables organizations to create their own branded private
version of a data marketplace
- Coatue uses this version
- The “owners” of these exchanges have complete control over which assets are listed and who has access to them. The private exchanges provides access to third-party data sources chosen by its owner. The private exchange enables the owner to access data from business partners and other sources that are not broadly available. The owners of these exchanges control who sees what and decide what data others get access to and when they get it
- Private exchange has a catalog, too. But instead of showing every
participant the entire universe of datasets, organizations see just the
ones that are available to them
- Get a shopping catalogue that has data that only fits you
- Provides a view in to all its business activities, internal and external
- data supply chain
- Analogy of dial tone
- A key feature of telephony that helped early telephony scale was dial tone. That’s the signal sent by a public telephone exchange or private branch exchange to a telephone indicating that the exchange is working and is ready for them to call someone. At first, telephone exchanges were handled manually by operators who physically connected one terminal device to another on a network by moving plugs on a board. Later the exchanges became automated. They managed the dial tone and made the connections without human intervention
- Public exchange
- Emerging trend - Data tone -
- The data exchanges, public and private, are the Bell Telephones and PBXs of the data universe. Before the exchange, parties sharing data had to manually connect to one another so that data could flow. The data exchanges enable not just one to one connections but “data conference calls”
- Snowflake breaks down traditional data boundaries by eliminating the need to change hands via APIs or more primitive forms of data transfer
- The impulse to create a marketplace is as old as human civilization. Technology has played a key role in making modern marketplaces work.
- The idea now is to create new kinds of marketplaces for information to facilitate the trading of dollars for data.
- Data marketplace is a very big idea that has the potential to transform the global economy
Marketing
- Four deadly sins of marketing
- Not recognizing customers when you meet them
- Pitching customers products they bought weeks ago
- Not knowing how individuals like to interact
- Spam
- Use cases
- 360 view of a customer
- Attribution
- clean room technologies
- Campaign data feedback
- Braze share marketing data
- help marketers orchestrate consumer experiences and conversations across all channels
- share the data across channels
- real time ingestion of data via snow pipe
- uses 10,000 server computers per week
- Electronic Arts, a global leader in digital interactive entertainment
- uses cloud platform to update the content of games if marketers spot opportunities to improve customer enjoyment
Media, Advertising and Entertainment
- Use cases
- Understand your content and audience
- Respond to changing consumer demands
- Protect personal privacy
- Data is the currency of advertising
- Wunderman Thomson
- Ability to manage cloud stewardship
Retail and Consumer Packages Goods
- J.Sainsbury PLC using Snowflake
- Identified three types of internal users - data scientist, professional analysts and “citizen analysts”
- Social networks and always-on personal technologies have shifted the power almost entirely to the consumer
- Scaling compute
- You get a massive Ferrari system when need it, but it goes away when you don’t need it
- Office Depot move to Snowflake
- Retailers - Supplier - private data exchange
Financial Services
- Eno fintech app
- Financial services companies are effectively data and analytics companies
- Segments using Snowflake
- Retail and commercial banking - identify branches and regions that are under performing
- Investments and asset management - Managing internal and external data
- Insurance - Improve their ability to spot fraud
- Lending - including diverse data sources for credit scoring
- Fintech - customer facing apps on top of Snowflake
- Capital One shift to Snowflake
- Chime FinTech using Snowflake
- Coatue - asset management company uses private marketplace to manage their data
Government and Education
- Government agencies face a conundrum. They routinely collect and store massive quantities of information, yet they typically do not possess the most up-to-date technologies for data management
- Government agencies face a conundrum. They routinely collect and store massive quantities of information, yet they typically do not possess the most up-to-date technologies for data management
- Why Snowflake ?
- There are no limits
- Easier to share data with sister agencies
- Simple and cost effective
- Cloud is faster
- Government agencies follow a rigorous multistep process before the keys to the cloud can be handed over
- Fraud, waste and abuse are the plagues of the government
- State agencies
- using IoT for automated highway toll tracking and billing systems.
- attaching network sensors to devices and physical structures to monitor activities where it is unfeasible for humans to do the work
- Opportunities to spot and quickly react to safety hazards
- Municipalities
- Smart city initiatives
- Education
- Improve services for students, staff and alumni
Healthcare and Life Sciences
- Use cases for Snowflake
- 360 degree view of the patient
- Precision medicine - treat individuals in a personalized way
- Bringing new patients to market
- Some of the research initiatives will reside on private exchange platform
- The healthcare is in the early stages of a data revolution, one that will transform the diagnosis and treatment of patients, the discovery of drugs, and the marketing of drugs to people who will benefit from them
The Democratization of Data
- 20,000 people are registered as members of snowflake community
- company rewards individuals who voluntarily contribute to knowledge-sharing forums and community blogs
Data for Good
- Apollo Agriculture use case
- Credit worthiness using ML
- Kiva use case
- crowdfunding loans
- personalize lenders experience
- Kiva Capital
Digital Transformation
- Two major pathways for innovation
- Automation
- Providing more powerful tools to help people extract insights from data
- Advances will come that put data science capabilities in the hands of analysts and even business unit managers - the so-called citizen data scientist
- Matillion is innovating with a different approach - ELT. In this scenario, the data is reformatted after it arrives in its more permanent home.
- Matillion sees three trends
- key focus remains on increasing the sophistication of the tools aimed at the company’s current users, the data engineers and data scientists
- making it easy for customers to see and manage their data even when it is stored and running in more than one public cloud
- making the products even easier to use so people within their customers organizations who aren’t computer scientists can curate data
- The datasets that DataRobot’s technology uses to train the models are stored and managed in Snowflake’s cloud data platform
Data is Power
- Volume plus Velocity is Competitive advantage
- There is a parallel concept to time value of money in the data world. It is time value of data. Information is more valuable when it arrives rapidly and can be acted at once. The longer it takes for data to arrive and process, the less valuable it becomes
- Data sharing fuels the data economy