This blog post summarizes the book titled “Hands on Gradient Boosting with XGBoost and scikit-learn”, by Corey Wade

Context

Every machine learning engineer/data scientist wants to quickly build a first version of the model before jumping on the more sophisticated Deep Learning based models. In the classic ML algos that are available out there, XGBoost is a star. It has been used by many kagglers and has been widely supported by many cloud platforms. There are readily available containers from AWS that one can use in SageMaker and get the job done. If you want to use the interface from any library and you are short on time, then this book is the right book. It is perfect for someone who wants to use XGBoost algorithm quickly and check the results. There is no too much math in the book, as it aimed at a broader audience who might not be too keen on understanding all the gory details behind the algo.

In this post, I will try to briefly summarize the contents of the book:

ML Landscape

The first chapter of the book gives an overview of ML and walks the reader through a dataset each for classification and regression task. Basic preprocessing steps are illustrated via pandas library. Subsequently a LinearRegression model output is compared with XGBRegressor model output for one of the datasets. Similarly LogisticRegression model output is compared to XGBclassifier model output. Models are compared after cross validation: various sklearn modules are put to use to crank out performance metrics for various models. The reader immediately sees that the XGBoost based models seem to be giving higher performance than the plain vanilla models. I think these preliminary examples give enough motivation for a reader to move to the next chapters of the book, that promise to explain the nuts and bolts of the extreme gradient boosting algorithm.

Decision Trees in Depth

XGBoost is an ensemble method, i.e. it combines a set of weak learners. Weak learner can be any ML model but traditionally it has always been decision trees. Hence it is apt to have a good understanding of decision trees, so that one can use XGB effectively. The chapter makes use of the cleaned dataset from the earlier chapters and showcases the various steps of building a decision tree regressor.In the process of building a decision tree, there are many settings for the hyperparameters that one might have to tune, in order to get the best performance on the validation set. The following are some of the hyperparameters that one can tweak:

max_depth: maximum depth of the regression tree
min_samples_leaf: specifies the restriction on the number of samples that a leaf may have
max_leaf_nodes
max_features: specifies the selection criterion on the number of features to be included for the decision tree modeling
min_samples_split: limits the number of samples required before a split can take place
splitter: specifies whether to use ‘random’ or ‘best’ splitter strategy
criterion: how should the splits be made ?
min_impurity_split
min_impurity_decrease
min_weight_fraction_leaf_
ccp_alpha_

The author suggest that out of all the parameters, tweaking the following more or less should do the job

max_depth
max_features
min_samples_leaf
max_leaf_nodes
min_samples_split
min_impurity_decrease

The chapter walks through the hyperparameter selection task using a dataset available at UCI ML Repo.

Random Forests

This chapter introduces randomforests, that are considered alternatives to Gradient boosting. The former methods use bagging for cutting down bias and variance that arise for fitting decision trees. XGB, as the name suggests, uses boosting to build successive learning models.

The chapter uses bikerentals data and census data to apply RandomForestClassifier and RandomForestRegressor models. These models are fit on the data set to show that these models are better than plain simple decision tree models. Since these models have a tendency to overfit too, one typically uses crossvalidation to report on the accuracy score.

The following are the hyperparameters of RandomForest

oob_score_
n_estimators_
warm_start_
max_depth_
max_features_
min_samples_split
min_impurity_decrease
min_samples-leaf
min_weight_fraction_leaf

The chapter also uses RandomizedSearchCV to shortlist the best parameters for the datasets. A series of trial and errors brings the error considerably for the bikerentals dataset. If the dataset is small and you want a quick model that cuts down bias and variance of the decision trees, then randomforests that are based on bootstrap aggregation are a good fit. One of the limitations of random forest is that it is limited by its individual trees. If all trees make the same mistake, the random forest makes this mistake. This is where Gradient Boosting shines as it improves upon the successive models that are fit.

From Gradient Boosting to XGBoost

The chapter starts off by performing a gradient boosting from scratch on the bikerentals dataset. It fits three decision trees sequentially on the residuals of the previous model. One can see that this process will yield the same results as that of using an out of box implementation of Gradient Boosting. Ofcourse one can tweak the parameters to get a better predictions on the dataset. The following are some of the hyperparameters relevant to GradientBoosting algo

learning_rate_ : known as shrinkage, shrinks the contribution of individual trees so that no tree has too much influence when building the model. If an entire ensemble is built from the errors of one base learner, without careful adjustment of hyperparameters, early trees in the model can have too much influence on subsequent development
n_estimators : Number of boosting rounds
max_depth : Depth of the weak learner
subsample : % of samples to be selected for each boosting round

These hyperparameters should not be treated individually as the knobs are interdependent. One obvious linkage to spot is the linkage between learning__rate and n_estimators. The former should go down as the latter increases.

A few rounds of manual as well as randomized checks give rise to a set of hyperparameters that bring the training error down to a level that is better than randomforests. The chapter uses GradientBoostingRegressor from sklearn and XGBRegressor for fitting model data. The two models are ways to perform the GradientBoosting. Even though the results are similar for small datasets, it is for big datasets that XGB shines. To illustrate this point, the chapter fits a GradientBoostingRegressor and XGBRegressor to a large dataset. On my laptop, GradientBoostingRegressor took ~ 5 minutes whereas XGBRegressor took 30 seconds. Seeing is believing. This example will motivate anybody to look at XGBoost as it is clearly far superior in accuracy and speed, as compared to GradientBoosting from sklearn

XGBoost Unveiled

A bit of history behind the algo. Tianqi Chen from University of Washington worked on enhancing the capabilities of Gradient Boosting and authored a paper titled, “XGBoost: A scalable tree boosting system”. What are the key features of XGBoost that makes it computationally faster than the previous algos:

Handling missing values

XGBoost is able to handle missing feature values and has an inbuilt rules to split the decision trees

Gaining speed

Approximate split-finding algorithm: algorithm uses a weighted quantile sketch based algorithm for determining the split and merging of various decision trees
Sparsity-aware split finding: In the case when majority of the entries are 0 or null, the algo creates matrices that store sparse matrices and this stores valuable space
Parallel computing: Boosting is not ideal for parallel computing since each tree depends on the results of the previous tree. However there are places where parallelization can be done such as split-finding algorithm
Cache-aware access: XGBoost uses cache-aware prefetching. It allocates an internal buffer fetches the gradient statistics, and performs accumulation with mini-batches.
Block compression and sharding : These techniques help with computationally expenseive disk reading by compressing columns. Block sharding decreases read times by sharding the data into multiple disks

Accuracy gains

Regularization is built in to the process of XGBoost and hence it is more accurate than regular gradient boosting procedures

Math

A brief derivation of the XGBoost math is given in the book. I think it is better to spend time reading the original XGBoost paper and understanding the math behind it, rather than relying on a few formula that suddenly appear out of context.

This chapter gives out two templates for classification and regression tasks. Also it uses the a reasonable large dataset (Higgs Boson) to perform a classification task using XGBoost python API instead of using the scikit wrapper that takes of creating DMatrices

XGBoost Hyperparameters

This chapter is all about tuning hyperparameters. A toy dataset is used to illustrate the various tuning steps. The first aspect to keep in mind is that the cross_val_score and GridSearchCV do not use the same k folds for reporting the scores. Hence it is better to create a StratifiedKFold class to provide train test indices that can be used across all functions that perform CV or Hyperparameter tuning. The list of knobs that one can turn are quite a few. Some of the prominent ones are

n_estimators
learning_rate
max_depth
gamma
min_child_weight
subsample
colsample_bylevel
colsample_bytree
colsample_bynode
max_delta_step
lambda
alpha
missing
scale_pos_weight

The chapter tries to turn the above knobs one at time and obtains a good out of sample error for the toy dataset. The process followed in this chapter can be used as a guide for hyperparameter tuning for any big data real life problem for which you want to use XGBoost

Exoplanets

This chapter contains a run through of all the concepts mentioned in the previous chapter, on a specific dataset. This dataset is highly imbalanced and hence the author takes great pains to highlight that accuracy is a misleading indicator of a model’s performance. It is precision and recall scores that matter. In that sense, the author tries to undersample and oversample the data to build better predictors. However I think there is a problem in the way the author undersamples and oversamples. One has to oversample the minority class exemplars or undersample majority class exemplars. The code mentioned in the chapter does not take this in to consideration and does an undersampling and oversampling on the entire dataset, which I think is incorrect. In any case, the point is well taken - one cannot blindly apply ML algos when the data is imbalanced. There needs to be some kind of balancing strategy applied to the classes before any ML algo can be applied.

Alternative Base Learners

This chapter mentions various base learners that can be used as a part of XGBoosting framework and the ones mentioned in this chapter are

gblinear : When the hypothesis is that data is more pertinent to linear classifier or regressor, one can use this base model. Ofcourse it is very difficult to find data in the real world that perfectly fits a linear model and not an non linear model. The author simulates a dataset that follows a linear relationship and shows that gblinear works better than gbtree
dart stands for Dropouts meet Multiple Additive Regression Trees. This came out in 2015 and one can try to use this, to check whether one obtains a better fit

The above mentioned base learners are a part of XGBRegressor and XGBClassifier objects in the XGBoost library.

XGBoost Random Forests

There are two strategies to implement random forests within XGBoost. The first is to use random forests as the base learner, the second is to use XBoost’s original random forests, XGBFRRegressor and XGBFRClassifier.

Base learner : There is no option to set the booster hyperparameter to a random forest. Instead, the hyperparameter num_parallel_tree may be increased from its default value of 1 to transform gbtree in to a boosted random forest. The idea here is that each boosting round will no longer consist of one tree, but a number of parallel trees, which in turn make up a forest.
Use XGBFRRegressor and XGBFRClassifier from the XGBoost library. These are random forest machine learning algorithms that are not base learners, but algorithms in their own right. These algos work in a similar manner to scikit-learn’s random forests. The primary difference is that XGBoost includes default hyperparameters to counteract over-fitting and their own methods for building individual trees.

Kaggle Masters

This chapter distills the wisdom from many kaggle grandmasters that have won many competitions. The following are some of the key points mentioned in the book

LightGBM a lightning fast Microsoft version of gradient boosting is a serious competitor to XGBoost
Feature engineering is key
- Converting categorical columns to frequencies instead of one hot encoding
- mean encoding technique
Winning models of Kaggle competitions are rarely individual models they are almost always ensembles.
Select non-correlated models
VotingClassifier from sklearn is a convenient function to create ensemble predictions
Stacking combines ML models at two levels: the base level, whose models make predictions on all the data, and the meta level, which takes the predictions of the base models as input and uses them to generate final predictions.
Stacked models have found huge success in Kaggle competitions. Most Kaggle competitions have merger deadlines, where individuals and teams can join together. These mergers can lead to greater success as teams rather than individuals because competitors can build larger ensembles and stack their models together
StackingClassifier from sklearn is a convenient function for stacking output from various base models and combine all the predictions in a different model.

Model Deployment

The last chapter serves a good case study to apply all the aspects mentioned through out the first 9 chapters of the book. I think it is better to not rush through the last chapter when things are fresh in your memory. It is better to take a break and then revisit this chapter a month or a few months after going through the first 9 chapters of the book. Why ? It gives your brain sufficient time to forget certain aspects of XGBoost and then when you revisit it later, you strengthen retrieval connections in your brain. I have always found it very helpful to take a pause and revisit it after a break and I think it applies to many others, as our learning is often better, if it is spaced.

Takeaway

I find that packt books are usually “tutorials” that are expanded in to “books”. Of course, there are exceptions. This book is definitely an exception as the author has put in great efforts to illustrate all the relevant concepts of XGBoost with a ton of examples. Obviously, to learn the math behind XGBoosting, you will have to look at other sources. But definitely worth going through the book , if you want to get a working knowledge of XGBoost.

Hands on Gradient Boosting - Book Review

Contents