Outlier treatment

This paper mentions a mechanism to clean high frequency data of outliers. The setting is NYSE TAQ(Trades and Quotes data) and many initial filters(data cleaning) applied are specific to NYSE. However the mechanism for removing outliers that is mentioned by is market agnostic. The key idea behind the method is to choose k neighbor prices + a fudge factor gamma, and compute a trimmed mean and standard deviation of the k neighboring prices. If the price point moves away from the trimmed mean of these k neighbor prices by 3 standard deviations plus a fudge factor, categorize the observation as an outlier else include it in one’s calculation.

To make a case for the need to remove outliers, the authors estimate ACD model on GE stock. Various values of k and gamma are evaluated and the best pair is selected for outlier treatment. The model is estimated based on dirty data as well as cleaned data. As expected, the duration is underestimated in the dirty data and the deseasonalized dirty data have lesser persistence effect than deseasonalized clean data. Also the coefficient estimates of conditional autoregressive equations differ based on the kind of data used. The technique is simple to understand and communicate. May be its strength lies in its simplicity.