Hands on Gradient Boosting - Book Review
Contents
This blog post summarizes the book titled “Hands on Gradient Boosting with XGBoost and scikit-learn”, by Corey Wade
Context
Every machine learning engineer/data scientist wants to quickly build a first version of the model before jumping on the more sophisticated Deep Learning based models. In the classic ML algos that are available out there, XGBoost is a star. It has been used by many kagglers and has been widely supported by many cloud platforms. There are readily available containers from AWS that one can use in SageMaker and get the job done. If you want to use the interface from any library and you are short on time, then this book is the right book. It is perfect for someone who wants to use XGBoost algorithm quickly and check the results. There is no too much math in the book, as it aimed at a broader audience who might not be too keen on understanding all the gory details behind the algo.
In this post, I will try to briefly summarize the contents of the book:
ML Landscape
The first chapter of the book gives an overview of ML and walks the reader
through a dataset each for classification and regression task. Basic
preprocessing steps are illustrated via pandas
library. Subsequently a
LinearRegression
model output is compared with XGBRegressor
model output for
one of the datasets. Similarly LogisticRegression
model output is compared to
XGBclassifier
model output. Models are compared after cross validation:
various sklearn
modules are put to use to crank out performance metrics for
various models. The reader immediately sees that the XGBoost based models seem
to be giving higher performance than the plain vanilla models. I think these
preliminary examples give enough motivation for a reader to move to the next
chapters of the book, that promise to explain the nuts and bolts of the extreme
gradient boosting algorithm.
Decision Trees in Depth
XGBoost is an ensemble method, i.e. it combines a set of weak learners. Weak learner can be any ML model but traditionally it has always been decision trees. Hence it is apt to have a good understanding of decision trees, so that one can use XGB effectively. The chapter makes use of the cleaned dataset from the earlier chapters and showcases the various steps of building a decision tree regressor.In the process of building a decision tree, there are many settings for the hyperparameters that one might have to tune, in order to get the best performance on the validation set. The following are some of the hyperparameters that one can tweak:
max_depth
: maximum depth of the regression treemin_samples_leaf
: specifies the restriction on the number of samples that a leaf may havemax_leaf_nodes
max_features
: specifies the selection criterion on the number of features to be included for the decision tree modelingmin_samples_split
: limits the number of samples required before a split can take placesplitter
: specifies whether to use ‘random’ or ‘best’ splitter strategycriterion
: how should the splits be made ?min_impurity_split
min_impurity_decrease
min_weight_fraction_leaf_
ccp_alpha_
The author suggest that out of all the parameters, tweaking the following more or less should do the job
max_depth
max_features
min_samples_leaf
max_leaf_nodes
min_samples_split
min_impurity_decrease
The chapter walks through the hyperparameter selection task using a dataset available at UCI ML Repo.
Random Forests
This chapter introduces randomforests, that are considered alternatives to Gradient boosting. The former methods use bagging for cutting down bias and variance that arise for fitting decision trees. XGB, as the name suggests, uses boosting to build successive learning models.
The chapter uses bikerentals
data and census
data to apply
RandomForestClassifier
and RandomForestRegressor
models. These models are
fit on the data set to show that these models are better than plain simple
decision tree models. Since these models have a tendency to overfit too, one
typically uses crossvalidation to report on the accuracy score.
The following are the hyperparameters of RandomForest
oob_score_
n_estimators_
warm_start_
max_depth_
max_features_
min_samples_split
min_impurity_decrease
min_samples-leaf
min_weight_fraction_leaf
The chapter also uses RandomizedSearchCV
to shortlist the best parameters for
the datasets. A series of trial and errors brings the error considerably for the
bikerentals
dataset. If the dataset is small and you want a quick model that
cuts down bias and variance of the decision trees, then randomforests that are
based on bootstrap aggregation are a good fit. One of the limitations of random
forest is that it is limited by its individual trees. If all trees make the same
mistake, the random forest makes this mistake. This is where Gradient Boosting
shines as it improves upon the successive models that are fit.
From Gradient Boosting to XGBoost
The chapter starts off by performing a gradient boosting from scratch on the
bikerentals
dataset. It fits three decision trees sequentially on the
residuals of the previous model. One can see that this process will yield the
same results as that of using an out of box implementation of Gradient Boosting.
Ofcourse one can tweak the parameters to get a better predictions on the
dataset. The following are some of the hyperparameters relevant to
GradientBoosting algo
learning_rate_
: known as shrinkage, shrinks the contribution of individual trees so that no tree has too much influence when building the model. If an entire ensemble is built from the errors of one base learner, without careful adjustment of hyperparameters, early trees in the model can have too much influence on subsequent developmentn_estimators
: Number of boosting roundsmax_depth
: Depth of the weak learnersubsample
: % of samples to be selected for each boosting round
These hyperparameters should not be treated individually as the knobs are
interdependent. One obvious linkage to spot is the linkage between learning__rate
and n_estimators
. The former should go down as the latter increases.
A few rounds of manual as well as randomized checks give rise to a set of
hyperparameters that bring the training error down to a level that is better
than randomforests. The chapter uses GradientBoostingRegressor
from sklearn
and XGBRegressor
for fitting model data. The two models are ways to perform
the GradientBoosting. Even though the results are similar for small datasets, it
is for big datasets that XGB shines. To illustrate this point, the chapter fits
a GradientBoostingRegressor
and XGBRegressor
to a large dataset. On my
laptop, GradientBoostingRegressor
took ~ 5 minutes whereas XGBRegressor
took
30 seconds. Seeing is believing. This example will motivate anybody to look at
XGBoost
as it is clearly far superior in accuracy and speed, as compared to
GradientBoosting
from sklearn
XGBoost Unveiled
A bit of history behind the algo. Tianqi Chen from University of Washington worked on enhancing the capabilities of Gradient Boosting and authored a paper titled, “XGBoost: A scalable tree boosting system”. What are the key features of XGBoost that makes it computationally faster than the previous algos:
Handling missing values
XGBoost is able to handle missing feature values and has an inbuilt rules to split the decision trees
Gaining speed
- Approximate split-finding algorithm: algorithm uses a weighted quantile sketch based algorithm for determining the split and merging of various decision trees
- Sparsity-aware split finding: In the case when majority of the entries are 0 or null, the algo creates matrices that store sparse matrices and this stores valuable space
- Parallel computing: Boosting is not ideal for parallel computing since each tree depends on the results of the previous tree. However there are places where parallelization can be done such as split-finding algorithm
- Cache-aware access: XGBoost uses cache-aware prefetching. It allocates an internal buffer fetches the gradient statistics, and performs accumulation with mini-batches.
- Block compression and sharding : These techniques help with computationally expenseive disk reading by compressing columns. Block sharding decreases read times by sharding the data into multiple disks
Accuracy gains
Regularization is built in to the process of XGBoost and hence it is more accurate than regular gradient boosting procedures
Math
A brief derivation of the XGBoost math is given in the book. I think it is better to spend time reading the original XGBoost paper and understanding the math behind it, rather than relying on a few formula that suddenly appear out of context.
This chapter gives out two templates for classification and regression tasks.
Also it uses the a reasonable large dataset (Higgs Boson) to perform a
classification task using XGBoost
python API instead of using the scikit
wrapper that takes of creating DMatrices
XGBoost Hyperparameters
This chapter is all about tuning hyperparameters. A toy dataset is used to
illustrate the various tuning steps. The first aspect to keep in mind is that
the cross_val_score
and GridSearchCV
do not use the same k
folds for
reporting the scores. Hence it is better to create a StratifiedKFold
class to
provide train test indices that can be used across all functions that perform CV
or Hyperparameter tuning. The list of knobs that one can turn are quite a few.
Some of the prominent ones are
n_estimators
learning_rate
max_depth
gamma
min_child_weight
subsample
colsample_bylevel
colsample_bytree
colsample_bynode
max_delta_step
lambda
alpha
missing
scale_pos_weight
The chapter tries to turn the above knobs one at time and obtains a good out of sample error for the toy dataset. The process followed in this chapter can be used as a guide for hyperparameter tuning for any big data real life problem for which you want to use XGBoost
Exoplanets
This chapter contains a run through of all the concepts mentioned in the previous chapter, on a specific dataset. This dataset is highly imbalanced and hence the author takes great pains to highlight that accuracy is a misleading indicator of a model’s performance. It is precision and recall scores that matter. In that sense, the author tries to undersample and oversample the data to build better predictors. However I think there is a problem in the way the author undersamples and oversamples. One has to oversample the minority class exemplars or undersample majority class exemplars. The code mentioned in the chapter does not take this in to consideration and does an undersampling and oversampling on the entire dataset, which I think is incorrect. In any case, the point is well taken - one cannot blindly apply ML algos when the data is imbalanced. There needs to be some kind of balancing strategy applied to the classes before any ML algo can be applied.
Alternative Base Learners
This chapter mentions various base learners that can be used as a part of XGBoosting framework and the ones mentioned in this chapter are
gblinear
: When the hypothesis is that data is more pertinent to linear classifier or regressor, one can use this base model. Ofcourse it is very difficult to find data in the real world that perfectly fits a linear model and not an non linear model. The author simulates a dataset that follows a linear relationship and shows thatgblinear
works better thangbtree
dart
stands for Dropouts meet Multiple Additive Regression Trees. This came out in 2015 and one can try to use this, to check whether one obtains a better fit
The above mentioned base learners are a part of XGBRegressor
and
XGBClassifier
objects in the XGBoost
library.
XGBoost Random Forests
There are two strategies to implement random forests within XGBoost. The first
is to use random forests as the base learner, the second is to use XBoost’s
original random forests, XGBFRRegressor
and XGBFRClassifier
.
- Base learner : There is no option to set the booster hyperparameter to a random forest.
Instead, the hyperparameter
num_parallel_tree
may be increased from its default value of 1 to transformgbtree
in to a boosted random forest. The idea here is that each boosting round will no longer consist of one tree, but a number of parallel trees, which in turn make up a forest. - Use
XGBFRRegressor
andXGBFRClassifier
from theXGBoost
library. These are random forest machine learning algorithms that are not base learners, but algorithms in their own right. These algos work in a similar manner to scikit-learn’s random forests. The primary difference is that XGBoost includes default hyperparameters to counteract over-fitting and their own methods for building individual trees.
Kaggle Masters
This chapter distills the wisdom from many kaggle grandmasters that have won many competitions. The following are some of the key points mentioned in the book
LightGBM
a lightning fast Microsoft version of gradient boosting is a serious competitor to XGBoost- Feature engineering is key
- Converting categorical columns to frequencies instead of one hot encoding
- mean encoding technique
- Winning models of Kaggle competitions are rarely individual models they are almost always ensembles.
- Select non-correlated models
VotingClassifier
fromsklearn
is a convenient function to create ensemble predictions- Stacking combines ML models at two levels: the base level, whose models make predictions on all the data, and the meta level, which takes the predictions of the base models as input and uses them to generate final predictions.
- Stacked models have found huge success in Kaggle competitions. Most Kaggle competitions have merger deadlines, where individuals and teams can join together. These mergers can lead to greater success as teams rather than individuals because competitors can build larger ensembles and stack their models together
StackingClassifier
fromsklearn
is a convenient function for stacking output from various base models and combine all the predictions in a different model.
Model Deployment
The last chapter serves a good case study to apply all the aspects mentioned through out the first 9 chapters of the book. I think it is better to not rush through the last chapter when things are fresh in your memory. It is better to take a break and then revisit this chapter a month or a few months after going through the first 9 chapters of the book. Why ? It gives your brain sufficient time to forget certain aspects of XGBoost and then when you revisit it later, you strengthen retrieval connections in your brain. I have always found it very helpful to take a pause and revisit it after a break and I think it applies to many others, as our learning is often better, if it is spaced.
Takeaway
I find that packt
books are usually “tutorials” that are expanded in to
“books”. Of course, there are exceptions. This book is definitely an exception
as the author has put in great efforts to illustrate all the relevant concepts
of XGBoost
with a ton of examples. Obviously, to learn the math behind
XGBoosting, you will have to look at other sources. But definitely worth going
through the book , if you want to get a working knowledge of XGBoost.