This post contains a brief note on my misadventures with AWS DynamoDB.

A few days ago, I had to pump in 150 million documents in to a NOSQL database. For some reason, I though that AWS DynamoDB was the best choice out on AWS. There were several roadblocks that I had to face as I had to learn the various restrictions on a corporate cloud.

IAM

In my previous experience, I never had to deal so much on the policy and role restrictions at all. In the new environment, there are a ton of restrictions that I had to work with. This meant that for every service that I intended to use, I had to create a AWS policy, structure the relevant policy document and then create a relevant role or attach to an existing role. This in itself was a big learning experience. Navigating the policy documents, assigning it to various resources was a very tedious task and I had no choice but to go over this process. The learning involved going through AWS documentation, looking at various policy documents, understanding their structure and finally getting it work on internal environment.

Naming Convention

Internal corporate environment has specific naming conventions that need to be followed in creating and managing various AWS resources. Also every resource that is created has to be tagged. This means that one cannot be sloppy with using any AWS resource. I guess this was tedious to follow but in the overall scheme of things, it is better to be systematic than sloppy especially when the environment is going to be used by many.

DynamoDB

I have had very limited experience with DynamoDB before and hence I had to go through the usual route of understanding the CRUD operations via CLI and Python. There are several aspects of DynamoDB that are very specific to this NOSQL database. Unless one spends time in understanding these basics, there is always a chance of messing up. Inherently messing up is good as one tends to learn a lot. However there are obvious things that one must learn so that glaring mistakes can be avoided in the first place.

Terminology

Understood the terminology used

  • Primary Key/ Partition Key
  • Sort Key
  • Local Secondary Index
  • Global Secondary Index
  • Read Capacity Unit
  • Write Capacity Unit
  • Auto-scaling - Pros and Cons
  • Items
  • Throttling and Exponential back-off

AWS Services

The following summarize some of the learnings, while experimenting with AWS Services:

  • Understood the CLI commands and Python’s boto3 library for DynamoDB
  • CLI is good enough for experimentation but one cannot ingest millions of items using AWS CLI
  • Tried my hand on learning boto3 and soon realized that one needs to carefully pick the provisioning capacity of the database so that it works well with the batch operations. There is no point in sending a batch request when there is a meager provisional capacity
  • Understood the various setting for capacity such as RCU, WCU, auto-scaling
  • Batch operations via boto3 seemed slow and hence was curious to look at AWS Data Pipeline.
  • AWS DataPipeline had another sort of learning curve. First came the IAM permissions. Had to work through the right policy statements to get the relevant privileges to create the resource. Once the data pipeline was created, it became apparent that the usual JSON structure of the document cannot be used. I had to convert millions of documents in to DynamoDB compatible JSON. This obviously meant that I had to use AWS Glue and that meant that I had to brush up on my PySpark syntax.
  • Brushing up on PySpark syntax meant that I needed to have a machine in the first place. Thanks to my recent machine crash, my Spark environment had evaporated. Hence I had to spend some time on recreating a Spark environment on my machine. That lead to an deadend again as the configuration was not good enough to test out the JSON conversion
  • The next place to look out was AWS EMR. I have had experience with using EMR previously but the IAM permissions was something that held me back from quick prototyping. Created the relevant policies and roles and managed to spin cluster for prototyping on Zeppelin Notebook
  • Once I had Zeppelin set up for Spark, it was easy to revisit some basic data munging syntax. However Zeppelin is not as user friendly as Jupyter notebook. More specifically, the keyboard shortcuts are pretty painful to get used to.
  • Managed to create a script that does the JSON conversion
  • Next stop was AWS Glue. Same story here. Most of the time spent on setting up IAM permissions. Managed to use Glue and convert all the relevant documents
  • Tried using DataPipeline for small sample data and a reasonably huge data. Realized that it wasn’t cut out to do my job. It was more like a blackbox and I had no clue why some of the jobs were failing
  • I turned to AWS Glue that seems to have a new sink connector to DynamoDB
  • In order to quickly prototype, I had to create a notebook instance where I could work with native glue library from AWS. Ended up creating a Sagemaker end point and a Sagemaker notebook. However this again lead to a dead end.
  • Finally all the options of using existing AWS services has closed, i.e DataPipeline did not work out for me. AWS Glue did not work out for my case.
  • In the process of working with many of the AWS services that require data to be in a specific format, in a specific folder etc, there was a ton of S3 bucket operations that I had to do, in order to get the data in the right format. For these tasks, of course ,it was accomplished with a combination of boto3 and Glue services
  • Was left with no choice but to go with the traditional boto3 library and scale up WCU for the instance. For this I had to spin up an EC2 instance and write a simple batch script that ingests documents. These documents were of course not the raw documents, but processed in such a way, that there are amenable for use with boto3 library

What happened finally ?

Given all the data wrangling, the AWS Service hopping that I had done, a happy conclusion would have been a successful ingestion of 150 million documents. Sadly, this adventure turned in to a misadventure. There was a size restriction of 40KB per item in DynamoDB that I had overlooked. Somehow I was under the impression that 40KB was good enough, until I really checked the dataset. Realized that there are a sizable number of documents that don’t fit the 40KB item size requirement. I was having a secondary index too. That meant that it was actually less than 20KB per item that I was allowed.

Ultimately I had to abandon my intention of loading millions of documents in to DynamoDB. Ton of lessons learned in the last few days. This post will serve as a reminder to me, of all things, that didn’t work out, for my usecase.