The paper titled, XGBoost: A Scalable Tree Boosting System, by Tianqi Chen, Carlos Guestrin came out in 2016 and since then it has been the goto algorithm for classification and regression tasks, until the deep learning algo implementations were made available across various platforms. Of course one can build a super deep neural network, feed the features, run backprop and get all the weights of the network. No feature engineering, No need to understand data, No need to think through the missing data; use a deep neural network and get your job done. In one sense, I think that is the appealing reason for many, to be drawn towards NN. Also the fact that you get to meet your objective of minimizing out of sample error seems to be like a nirvana. Why would one ever want to use classical statistical procedures ? XGBoost however seems to be still one of the favorite choices for many ML practitioners. The technique is very peculiar in the sense that it is not just an applied statistical technique but incorporates a healthy dose of system design optimization hacks that seems to have given it a massive edge over similar algos.

I had been wanting to read this paper, for quite sometime. I have used XGBoost in my work but never managed to read the original paper, that details the workings of the technique. Finally, thanks to a mini-break I took from work, I went through the paper and here are some notes relating to the paper

Introduction

The paper is divided in to seven sections, the first being a primer to ML specific to boosting related methods. The intro section is to motivate the reader about the success of XGBoost in various kaggle competitions. The emphasis is on speed and accuracy as compared to other models. The paper mentions the list of innovations that have made XGBoost better:

Novel tree learning algo for learning sparse data
Approximate tree learning based on quantile sketch procedure
Parallel and distributed computing
Enabling out of core tree learning and effective cache-aware block structure

Tree Boosting in a nutshell

All the relevant math is stated in the section. Boosting comprises creating sequential trees. If there are $K$ boosting rounds, then the function estimator takes the form $$ \hat y_i = \phi(\mathbf x_i) = \sum^K_{k=1} f_k(\mathbf x_i) , f_k \in \mathcal F $$ where $$ \mathcal F = \lbrace f(\mathbf x ) = w_{q(\mathbf x)} \rbrace ( q : \mathbb R^m \to T, w \in \mathbb R^T) $$ is the space of regression trees.

Each $f_k$ corresponds to a independent tree structure $q$ and leaf weights $w$. The regularized function to be minimized is $$ \mathcal L(\phi) = \sum_i l(\hat y_i, y_i) + \gamma T + {1 \over 2} \lambda ||w||^2 $$ The loss function is a combination of $l$, a convex loss function and subsequent terms penalize the complexity of the model.

Since the trees are built recursively and the predictions of the previous boosting rounds are used, the loss function can be written as $$ \mathcal L(\phi) = \sum_i l(\hat y_i, \hat y_i^{(t-1)} + f_t(\mathbf x)+) + \gamma T + {1 \over 2} \lambda ||w||^2 $$

Using Taylor series expansion, one can rewrite the above expression as $$ \tilde{\mathcal L}^{(t)} = \sum^n_{i=1} g_i f_t(\mathbf x_i) + {1 \over 2} h_i f_t^2(\mathbf x_t) + \Omega(f_t) $$

The above expression can further be simplified to find the optimal weight $w_j^\ast$ of leaf $j$. The optimal value is also used to create a splitting criterion.

Where is the Gradient step in Boosting ?

In splitting criterion, the terms that appear in the expression are the residuals from the previous boosting round. Each boosting tree prediction is nothing but a gradient descent step in the predictor space. The best way to understand the previous statement is to spend some time going through the math.

Behind all the math

The intuition behind all the math is this: One looks at the residuals at each node, uses a splitting criterion that depends on the number of residuals, regularization parameter and the residual values. Once the splitting criteria is computed at each node, the best splitting criteria is used to compute the weight of each leaves. This expression contains the gradient and hessian of the loss function. Each update to the previous prediction is a gradient descent step in the predictor space. Once the math is clear in your head, the verbalization of the math becomes very easy.

Shrinkage and Column subsampling

The method also entails using to additional techniques,

a learning parameter $\eta$ to make sure that individual tree outcome does not dominate the previous boosted trees
sample the features, similar to random forest method, so that there is a reduction in variance

Split Finding Algorithms

One of the main innovations suggested by this technique is an efficient way to find splits for big data. Greedy way to split a tree is time consuming and will not work for huge datasets. The basic idea behind approximate algo is that instead of looking at every possible split for every feature, a candidate quartiles are constructed and these quartiles are explored for splitting data. The construction of quartiles can be parallelized. Spread across the data across multiple machines, construct approximate quantiles of data and then look for splitting the data based on the quantiles. The default value is 33 quantiles for the data. These quantiles are not the usual quantiles that have same number of observations. The quantile cut is based on the hessian values and hence called weighted quantile sketch algorithm. The appendix part of the paper gives all the mathy details behind weighted quantile algo.

The algos mentioned in this section are

Basic Exact Greedy algo
Approximate algo
Weighted Quantile sketch
Sparsity aware split finding

System Design

This part of the paper emphasizes on the engineering aspects of the algo that gives the algo accuracy and speed

cache aware access
blocks for out-of-core computation

End to End evaluations

The authors evaluate the performance of XGBoost on four datasets that are about

Insurance claim classification
Even classification
Learning to Rank
Click through rate prediction

In all the above massive datasets, XGBoost gives a superior accuracy and speed than a bunch of competing algos.

The paper also has a table that compares the major tree boosting systems. The table serves as a good reference to get an idea of all the features that have been incorporated in XGBoost

System	exact greedy	approximate global	approximate local	out-of-core	sparsity-aware	parallel
XGBoost	yes	yes	yes	yes	yes	yes
pGBRT	no	no	yes	no	no	yes
Spark	no	yes	no	no	partially	yes
H20	no	yes	no	no	partially	yes
scikit-learn	yes	no	no	no	no	no
R GBM	yes	no	no	no	partially	no

Takeaways

Going through this paper was checking off a TO-DO list item that was pending for quite sometime. Weighted Quantile Sketch algorithm is my biggest learning from the paper.

XGBoost is a sequential boosting process, so how can one parallelize this learning process for large datasets? If this question bugs you, it might be worth your time to go over this paper and understand all the innovative algos that make XGBoost fast and accurate learner. us

XGBoost Seminal Paper - Summary

Contents