AWS Dynamo DB - Misadventures
This post contains a brief note on my misadventures with AWS DynamoDB.
A few days ago, I had to pump in 150 million documents in to a NOSQL database. For some reason, I though that AWS DynamoDB was the best choice out on AWS. There were several roadblocks that I had to face as I had to learn the various restrictions on a corporate cloud.
IAM
In my previous experience, I never had to deal so much on the policy and role restrictions at all. In the new environment, there are a ton of restrictions that I had to work with. This meant that for every service that I intended to use, I had to create a AWS policy, structure the relevant policy document and then create a relevant role or attach to an existing role. This in itself was a big learning experience. Navigating the policy documents, assigning it to various resources was a very tedious task and I had no choice but to go over this process. The learning involved going through AWS documentation, looking at various policy documents, understanding their structure and finally getting it work on internal environment.
Naming Convention
Internal corporate environment has specific naming conventions that need to be followed in creating and managing various AWS resources. Also every resource that is created has to be tagged. This means that one cannot be sloppy with using any AWS resource. I guess this was tedious to follow but in the overall scheme of things, it is better to be systematic than sloppy especially when the environment is going to be used by many.
DynamoDB
I have had very limited experience with DynamoDB before and hence I had to go through the usual route of understanding the CRUD operations via CLI and Python. There are several aspects of DynamoDB that are very specific to this NOSQL database. Unless one spends time in understanding these basics, there is always a chance of messing up. Inherently messing up is good as one tends to learn a lot. However there are obvious things that one must learn so that glaring mistakes can be avoided in the first place.
Terminology
Understood the terminology used
- Primary Key/ Partition Key
- Sort Key
- Local Secondary Index
- Global Secondary Index
- Read Capacity Unit
- Write Capacity Unit
- Auto-scaling - Pros and Cons
- Items
- Throttling and Exponential back-off
AWS Services
The following summarize some of the learnings, while experimenting with AWS Services:
- Understood the CLI commands and Python’s
boto3
library for DynamoDB - CLI is good enough for experimentation but one cannot ingest millions of items using AWS CLI
- Tried my hand on learning
boto3
and soon realized that one needs to carefully pick the provisioning capacity of the database so that it works well with thebatch
operations. There is no point in sending a batch request when there is a meager provisional capacity - Understood the various setting for capacity such as RCU, WCU, auto-scaling
- Batch operations via
boto3
seemed slow and hence was curious to look atAWS Data Pipeline
. AWS DataPipeline
had another sort of learning curve. First came the IAM permissions. Had to work through the right policy statements to get the relevant privileges to create the resource. Once the data pipeline was created, it became apparent that the usual JSON structure of the document cannot be used. I had to convert millions of documents in to DynamoDB compatible JSON. This obviously meant that I had to useAWS Glue
and that meant that I had to brush up on myPySpark
syntax.- Brushing up on
PySpark
syntax meant that I needed to have a machine in the first place. Thanks to my recent machine crash, mySpark
environment had evaporated. Hence I had to spend some time on recreating a Spark environment on my machine. That lead to an deadend again as the configuration was not good enough to test out the JSON conversion - The next place to look out was
AWS EMR
. I have had experience with using EMR previously but the IAM permissions was something that held me back from quick prototyping. Created the relevant policies and roles and managed to spin cluster for prototyping on Zeppelin Notebook - Once I had Zeppelin set up for Spark, it was easy to revisit some basic data munging syntax. However Zeppelin is not as user friendly as Jupyter notebook. More specifically, the keyboard shortcuts are pretty painful to get used to.
- Managed to create a script that does the JSON conversion
- Next stop was
AWS Glue
. Same story here. Most of the time spent on setting up IAM permissions. Managed to useGlue
and convert all the relevant documents - Tried using
DataPipeline
for small sample data and a reasonably huge data. Realized that it wasn’t cut out to do my job. It was more like a blackbox and I had no clue why some of the jobs were failing - I turned to
AWS Glue
that seems to have a new sink connector toDynamoDB
- In order to quickly prototype, I had to create a
notebook
instance where I could work with nativeglue
library fromAWS
. Ended up creating aSagemaker
end point and aSagemaker
notebook. However this again lead to a dead end. - Finally all the options of using existing AWS services has closed, i.e
DataPipeline
did not work out for me.AWS Glue
did not work out for my case. - In the process of working with many of the
AWS services
that require data to be in a specific format, in a specific folder etc, there was a ton ofS3
bucket operations that I had to do, in order to get the data in the right format. For these tasks, of course ,it was accomplished with a combination ofboto3
andGlue
services - Was left with no choice but to go with the traditional
boto3
library and scale up WCU for the instance. For this I had to spin up anEC2
instance and write a simple batch script that ingests documents. These documents were of course not the raw documents, but processed in such a way, that there are amenable for use withboto3
library
What happened finally ?
Given all the data wrangling, the AWS Service
hopping that I had done, a happy
conclusion would have been a successful ingestion of 150 million documents.
Sadly, this adventure turned in to a misadventure. There was a size
restriction of 40KB
per item in DynamoDB
that I had overlooked. Somehow I
was under the impression that 40KB
was good enough, until I really checked the
dataset. Realized that there are a sizable number of documents that don’t fit
the 40KB
item size requirement. I was having a secondary index too. That meant
that it was actually less than 20KB
per item that I was allowed.
Ultimately I had to abandon my intention of loading millions of documents in to DynamoDB. Ton of lessons learned in the last few days. This post will serve as a reminder to me, of all things, that didn’t work out, for my usecase.