Machine Learning for Business Using Amazon SageMaker and Jupyter
Contents
The following are my learnings from the book .
Context
I had purchased a course on SageMaker from Udemy on Aug. 29, 2019. I knew that I had to work on SageMaker at some point in time because the SVM demo that I had created on Tick data was not scaling well. It took enormous time to run. I was ambitious to think that I would be able to understand SageMaker without understanding the nuts and bolts of AWS. It took me 4 months before I realized that I had a cursory knowledge of AWS components and it was in my best interests to go through it in details. Since I was anyway going through it in detail, I took up the challenge of attempting Solutions Architect certificate. So, for close to 2 weeks, I went through AWS immersion, took at least 10 tests, before attempting the AWS exam and clearing it. With that in the past, my desire to understand SageMaker made me pick this book. As usual, I went through a high level course on AWS SageMaker and understood that I was completely clueless in certain areas such as
- Docker
- Kubernetes
- SageMaker basic workings
- How to deploy an algo to serve predictions
I knew I had to pick an introductory book on SageMaker and then I stumbled on to this book. I am glad that I picked up this book on
and within a week worked through all the examples. After typing in each of the examples in the book, I have developed some basic understanding of SageMaker. I will always remember this book that provided me the initial start in to SageMaker journeyHow machine learning applies to your business
This chapter gives a high level overview of Machine Learning. The most important takeaways for someone who is new to SageMaker are:
- All algorithms are hosted on containers in separate regions. Each region has a separate URI for accessing the algo
- One has to have a good understanding of basic theory of containerization to appreciate the engineering magic that’s happening in the background
- One has to understand the various hyper parameter knobs available via SageMaker for various algos
- With Machine learning, it is possible to bridge the gap between end to end systems and best of breed systems. With ML, you can enhance the capabilities of end to end solutions.
Should you send a purchase order to a technical approver?
This chapter deals with the application of the XGBoost algorithm on a dataset with categorical features. The labaled dataset comprises a binary outcome, i.e. whether the request has to be sent through a technical approval or not, and a bunch of categorical features. There is a basic data munging operation that is done, to convert the categorical features in to numerical features. Subsequently training, validation and test data are stored on s3 buckets. The stage is set for model tuning and deployment. This is done in the following way
- Create a SageMaker session
- Create a container object that points to the relevant XGBoost image
- Instantiate an estimator object by passing container, role, instances, SageMaker session and other relevant details
- Set the hyperparameters of the session
- Train the estimator
- Deploy the estimator at an endpoint
- Make predictions using the endpoint
Came to know about s3fs
library that makes it easy to manage data on S3. S3Fs is a Pythonic file interface to S3. It builds on top of botocore.
Should you call a customer because they are at risk of churning?
This chapter is about using XGBoost for logistic regression task. The dataset is a labeled dataset that contains numerical feature vectors for customers who belong to churn and no-churn classes. The dataset provided already comes with all the feature transformation done. Hence the set up is straighforward. Split the file in to training, validation and test data. Subsequently store the data in S3 buckets. Once the data is stored in S3 buckets, SageMaker session is created and XGBoost container is specified as a part of estimator. This estimator would then be used to train XGBoost Binary Logistic Model. After knowing the way to invoke Gradient boosting, the most important thing to consider are the various hyperparameters.Unless one is comfortable with the basic idea of the algo, it becomes difficult to gauge the relative importance of various tuning knobs. There are a ton of learning materials listed on XGBoost site that one can go through and learn the algo details.
Should an incident be escalated to your support team?
This chapter deals with tweet data that comes with labels, i.e. each tweet is labeled as 1 or 0, based on whether the tweet content needs to be escalated or not. The ML algo used in this case is blazingtext
algo from SageMaker. The algo needs the data to be in a certain format with label and the relevant tokens following the label in each line of a text file. The example uses nltk
to tokenize the text. One can use any of the available libraries to do the same.
The hyperparameters in the algo are
- Mode - whether the algo needs to be used in supervised or unsupervised mode
- Epochs - How many epochs should be used for training
- Vectordim - The word vector dimensions
- patience - early stopping after how many epochs
- minepochs- Min number of epochs before early stopping logic kicks in
- Whether to consider uni/bi/tri grams
I think the key points to keep in mind are the hyperparameters for the algo.
Should you question an invoice sent by a supplier?
This chapter was interesting as I have learnt about “Random-Cut” algorithm. I had never come across this algo in the usual ML texts. Subsequently realized that this algo was created by AWS and it is used in several AWS services. Random Cut Forest can’t handle text data and hence the categorical data needs to be coded using dummy factors. Once the data is converted in to numbers, the invocation of Random Cut Algo is no different from any other SageMaker algo in terms of basic steps. Infact there are very few hyperparameters related to Random Cut, i.e. number of samples per tree and number of trees. The authors also touch upon model evaluation metrics such as precision, recall and f1-score.
Forecasting your company’s monthly power usage
This chapter deals with using Deep AR to forecast multivariate time series dataset. The dataset comprises time series data of power usage across 48 sites. The key idea is to create train and test data sets that in a format that is compatible with SageMaker.
I have also learnt that the test datasets should span the entire dataset and not just the observations outside the training time window. Why is this the case? I need to dig deep in to Deep AR and understand the algo. I found a bug in this code where the authors failed to take in to consideration missing values at the start of the dataset. In the above snippet, they have considered the startdate for each of the time series to be same, which is definitely not the case, especially after trimming the missing values at the front. May be the author will release an errata and correct this part of code.
Improving your company’s monthly power usage forecast
This chapter deals with using Deep AR to forecast a time series dataset that carries other covariate information. There are two types of additional series that can be incorporated in to the time series prediction. First is the categorical time series and the second is the Dynamic time series. It is important to keep in mind that Deep AR allows missing values among covariates in the training data time range but not in the prediction data time range
The specific structure of the dataset in JSON format looks as follows :
Came to know that SageMaker provides a DeepARPredictor
class that can be used to review the results of DeepAR model as a pandas
DataFrame rather than JSON
object.
I found the MAPE calculation over 28 day forecast horizon pretty weird. The calculation involves summing up all the forecasts and then computing a MAPE. Why should one sum the forecasts across all the 28 days ? Somehow it did not make sense to me.
Serving predictions over the web
This chapter talks about ways to make SageMaker inference endpoint accessible via REST API from any public network. I have learnt about chalice
software that makes deployment of web apps over AWS a hassle-free experience. The software comes with a pre-configured “hello-world” application that you can tweak it so that the API can be used to invoke SageMaker inference end point via a Lambda function. I think I will revisit this chapter once I deploy a Deep Learning model and use it for doing real time inference on tick data.
Case studies
The last chapter of the book focuses on WorkPac and Faethm, two companies that have incorporated Machine Learning in to their workflow.
Takeaway
This book turned out to be perfect for me. By immersing in it for 5 days, I have understood the basic workings of SageMaker. By coding the examples mentioned in the book, I have understood the various components of AWS SageMaker algos. Would recommend this book to anyone who wants to get a taste of SageMaker so that they are equipped to dive deep in to SageMaker details