Monday 25 May 2015

Accounting for Data Mining Bias

I've recently subscribed to this forexfactory thread, which is about using machine learning to develop trading systems, and the subject of data mining/data dredging has come up. This post is a short description of how mining/dredging can be accounted for, but readers should be aware that the following is not a precise description of any particular test with accompanying code, but rather a hand wavy description of a general but important principle.

Suppose one has conducted a whole series of tests on a particular set of data with a view to developing a trading system. The precise nature of this is not really important - it could be some machine learning approach, a grid search of moving average crossover parameter values, a series of elimination contests to find the "best" indicator, or whatever. While doing this we keep a record of all our results and when the search is complete we plot a histogram thus:-
which is the result of 160,000 distinct tests plotted in 200 bins. Naturally, having done this, we select the best system found, represented by the vertical cursor line at x-axis value 5.2. This 5.2 is our test metric of choice, be it Sharpe ratio, win to loss ratio, whatever. But then we ask ourselves whether we have truly found a world beating system or is this discovery the result of data mining?

To test this, we create a random set of data which has the same attributes as the real data used above. The random data can be obtained by Bootstrapping, random permutation, application of a Markov chain with state spaces derived from the original data etc. The actual choice of which to use will depend on the null hypothesis one wants to test. Having obtained our random data set, we then perform the exact same search as we did above and record the test metric of best performing system found on this random data set. We repeat this 160,000 times and then plot a histogram ( in red ) of the best test results over all these random data sets:-
We find that this random set has a mean value of 0.5 and a standard deviation of 0.2. What this red test set represents is the ability/power of our machine learning algo, grid search criteria etc. to uncover "good" systems in even meaningless data, where all relationships are, in effect, spurious and contain no predictive ability.

We must now suppose that this power to uncover spurious relationships also exists in our original set of tests on the real data, and it must be accounted for. For purposes of illustration I'm going to take a naive approach and take 4 times the standard deviation plus the mean of the red distribution and shift our original green distribution to the right by an amount equal to this sum, a value of 1.3 thus:-
We now see that our original test metric value of 5.2, which was well out in the tail of the non-shifted green distribution, is comfortably within the tail of the shifted distribution, and depending on our choice of p-value etc. we may not be able to reject our null hypothesis, whatever it may have been.

As I warned readers above, this is not supposed to be a mathematically rigorous exposition of how to account for data mining bias, but rather an illustrative explanation of the principle(s) behind accounting for it. The main take away is that the red distribution, whatever it is for the test(s) you are running, needs to be generated and then the tests on real data need to be appropriately discounted by the relevant measures taken from the red distribution before any inferences are drawn about the efficacy of the results on the real data.

For more information about data mining tests readers might care to visit a Github repository I have created, which contains code and some academic papers on the subject.