Data Leakage
Contents
The following is an excellent summary of Data Leakage in time series testing.
Prepare for a very long clarification, sorry for the length, but I hope very much that you find it helpful!
Well in the realm of data science, data leakage can be simply defined as when “data outside your training dataset is used to create the model”. The end result of data leakage is that the model ends up knowing something that it could/should not possibly know at the training stage and therefore your results (performance, test) are invalidated. There are a couple of key things to remember when designing ML pipelines and models to avoid data leakage:
At any given point in time, ask whether you and/or the model would have access to the given data. This is especially crucial in time-series problems, so I’m going to give you an expansive example: If you’re dealing with say… economic data to forecast stock prices. You can certainly use older values of the stock to forecast newer values, and you can do this with a very small latency compared to other sources of data. In fact, stock data is available within milliseconds, practically instantaneously for financial firms anyway. For regular consumers, there is usually a 30 minute delay when viewing stock prices unless they’re at a Bloomberg terminal. So let’s say you are a retail investor (average joe) and you’re designing this kind of model. If you want to predict the next day’s return for a stock based on data from the past week, that’s fine. If you want to predict the closing price based on the opening price (ridiculous, I know) and it’s midday, that’s all copacetic. If, however, you want to use the last hour’s worth of data to predict the next 5 minute’s performance, STOP, that’s data leakage. Sure, in backtests, you have access to continuous time-series data, but in practice, you won’t have access to the last hour’s worth of data at any given point, it’ll be delayed by 30 minutes. Thus, when you’re using your model it’ll be expecting data you don’t have, and it’ll be unusable. If you decide to use it anyway, you’re feeding in the wrong hour-worth-of-time, so you’ll be predicting the return from 30 minutes ago, which is not what you want to be doing. Another thing that could happen is saying “I’m going to predict the stock’s return based on economic data, like unemployment rate” or something like that. Fine enough, good idea. So most economic variables (i.e. unemployment) are tallies and released on a monthly basis, so you get all this data and you say “I’m going to take the month’s unemployment rate (or whatever) from month X and predict the return of the S&P in month X” once again, it’s time to STOP, that’s data leakage. In backtests, you have unemployment rates for every month and you have returns data for every month. In real time though, that’s not true. While whatever economic dataseries you get (say from FRED, a great data resource) will have say… the month of January/2019 tagged with an unemployment rate (4.0%), the date is referring to the month the unemployment is referring to, not when that data was known or made available. So maybe you say, “OK, whatever man, I’ll just use last month’s unemployment to predict this month’s stock return, I’ll calculate it on the 1st, that way we have all the data” well that’s wrong, unemployment data for month X is released on the first Friday of month X+1, so doing calculations on the 1st would be imparting data that you don’t have, and skewing your results. This is why domain knowledge is so important in the field of data science and machine learning, without knowing the nature of this data, you might allow leakage into your model without knowing it, and then it’d be useless when it’s created. This applies to data engineers and data cleaning specialists too, when you see a dataset on Kaggle or elsewhere, it’s hard to know if they had the correct domain knowledge (you hope they do) and if their data was properly lagged or dealt with so as to avoid giving you a useless or purely hypothetical dataset/task. So normally in regular data science/ML, you’d use traditional cross-validation: i.e. you partition your data randomly and use that to train and test multiple models. When dealing with time-series (I keep mentioning it, but leakage is a relatively huge concern within the time-series subfield) that’d be invalid. In traditional time-series analysis, the most basic models would be say… martingale/markov (predict next value based only on current value), autoregressive (things move in trends), and moving average (incorporate previous model performance into future estimates). Lets use an autoregressive (AR) series for this example: Let’s start with a basic equation for an order-2 AR time-series regression: X(t) = β(0) + β(1) * X(t-1) + β(2) * X(t-2) + ε(t). This basically says that at time t, variable X is equal to β(0) or “beta-0” plus β(1) times the last value of X (at time t-1) plus β(2) times the value of X two time periods ago (time t-2) where all values of β are coefficient determined by the model. This is the AR forecast, for the equation we also add ‘ε(t)’ which is the error term/white noise that captures how different your model’s prediction was from the true value, either because the data is naturally noisy, or because there’s additional information that can be used to forecast it that’s not being considered in a simple AR model (generally you want to lower the error until it represents the noise that naturally occurs in the data, and at that point you probably have a robust model). So if you have a dataset, you start with a series with just continuous values for X: X=[5,6,7,5,8,4,6,5,…] and obviously for a traditional machine learning model you’re going to want to clearly distinguish the features in a table, so you make a table: (sorry for the quality of the table but when Kaggle says they support markdown they don’t really mean it)
t | X(t-2) | X(t-1) | X(t) |
---|---|---|---|
0 | NaN | NaN | 5 |
1 | NaN | 5 | 6 |
2 | 5 | 6 | 7 |
3 | 6 | 7 | 5 |
4 | 7 | 5 | 8 |
So you have data where for each time period (t) you have two independent variables (X(t-2) and X(t-1)) and one dependent variable (X(t)) if you were to apply normal cross-validation to this data, you’d randomly split up the rows of this data and use them to train the model. Well, what’s wrong with that? Well, if you shuffle a model like this naively, data leakage almost always occurs, you could end up training a model on row 3 (which tells the model the value of X for t=2 and t=1) and then testing it on row 1. BOOM data leakage has occurred. The model has unfair access to data in the training stage that will influence it’s decisions in the test stage. In the real world, a model such as this would never be trained on data that comes after the data they’re using it to predict, that’d be unfair for multiple reasons. Firstly: in robust models like neural networks, they can actually adapt to have a memory of their inputs and outputs (which they sometimes do to overfit the data without actually learning underlying patterns, which is useful in certain novel situations but often is detrimental to traditional ML work), therefore providing it with data from the future (which includes data from the past) means that data from the past is “leaking” into the model and therefore invalidating results that come from tests on that data. Secondly: if you train & test a model on data without regard to the time-periods where the divisions come from, you could end up training a stock-trading algorithm on data from say… the 2008 crash and then testing it during 2001. With the knowledge and training from the relatively unanticipated ‘08 crash in it’s experience, it’d be better equipped to deal with the Flash Crash of ‘01, and you’d get a good test performance which is not representative of reality, because in reality you’d train it and then use it in chronological order. Any exposure to data “from the future” is leakage that could be detrimental, say you’re looking at a very long time frame and you train with data from the 2010’s and then test on data from 2000. The 10 years around 2010 have been known as the “lost decade” in finance because from beginning to end the market didn’t improve (which it almost always has done) because of the crash in the middle of that time range destroying tons of market value. If you expose the model to the behavior of the market in the 2010’s, that’s knowledge that the model could never have gained by looking at data from the 1900’s up until the year 2000, it was a new regime, and leakage of that information would invalidate your testing results and give you an overly positive opinion regarding your model. It is for those reasons why the field of time-series has developed its own versions of K-Fold cross-validation to avoid these issues. The most notable of these alternative methods is “Walk-Forward Cross Validation” which is made to be robust to the data leakage issues seen commonly in time-series data. There are other ways to get around data leakage in time-series cross validation. Notably Professor Marcos Lopez de Prado is a fan of more traditional methods of Cross-Validation rather than Walk-Forward, he has proposed his own version called “Combinatorial Cross Validation” (discussed a bit in this article) and talks in his book about methods of selectively dropping data based on uniqueness and other metrics in order to perform more traditional CV techniques without data leakage leading to overfitting.
I know I focused primarily on time-series data in this answer, which may not necessarily be the most applicable to you. But I think overall, out of all fields of data science and ML, the field of time-series has some of the highest (if not the highest) vulnerability to data leakage. Thus I think time-series data provides a good learning opportunity to hear about how leakage happens and how it can be avoided.
I hope you found this answer helpful!