Removing Outliers


When training the ANN, it is worth visually inspecting everything for possible unexpected anomalies. For example, it is worth watching the distribution of the inputs, the outputs, visualizing the weights. When visualizing the strength of the weights, you can spot if a weight is too weak (its input is irrelevant; you can chuck that input; saving computation efforts), or if the weight is too strong (the ANN is over trained; memorized; it has an affection for a specific input. Maybe the input distribution was not good enough.) It is also worth visualizing deterministic (not ANN) strategies based on statistical interference. Like the naive algorithm: ‘Go long on Wednesday, if the average of the previous 40 Wednesday is positive’.
We created 5 portfolios for the different days of the week. For example, the Monday strategy goes long only on ‘Monday’, and it is in cash every other days. The cumulative return from 1987 for the SPX and for the RUT is: (click for larger image)

We annotated the chart. The critical period is the 2008 crash. Here, the volatility went sky high. We could experience -9% daily changes in the RUT. We reckon that the -9% daily drop on Wednesday was not because it was a Wednesday, but it was an unfortunate ‘by chance’ event that accidentally happened on that day. This behaviour doesn’t fit into our model, into our world simulation. It was just randomness. However, it affected our ANN training for the next 200 days (we have a 200 days lookback window for getting training samples).
Even if we increase the training lookback window to 3 years, the average Wednesday performance would still be negative. And the ANN would infer that Wednesdays are bearish days.
But it is far from the truth (as it can be seen on the chart)
That is the disaster outliers can cause.

There are two solutions in the ANN literature to handle outliers. One is removing them from the samples, the other is clipping (cap, trim) them to a maximum threshold (like 3 times the standard deviation, 3xDmax).
Outlier detection and outlier treatment are two separate things.

1. Outlier detection
A couple of methods exist based on:
– outliers that are X std away from the arithmetic mean (X=2,3,4,5)
– outliers that are X mad away from the median (X=2,3,4,5)
– outliers that are X constant away from the arithmetic mean (based on the problem domain: X=3%, 4%, 5% in our case)
– K-nearest neighbour filtering for high dimensional cases where the simple std is not sufficient metric for high dimensional distribution

The problem with the mean+std outliers detection is that we have a 200 days rolling window. As we move forward with that window, after a while, the std of the 200 days window increases as we go to the highly volatile 2008 crash period; The std of the 200 days window goes from 1.5% to 3.2%; 3 times the std threshold is 9.6. After a while the -7% daily %loss was accepted a non outlier. A solution would be that the fix%Value threshold is calculated based on the total std of the 25 years, not only the rolling 200 days window.

2. Outlier treatment

Basically, we can clip these %gain values to 2 std away from the mean or we can exclude them from the samples.
However, note the warning from a site

Deletion of outlier data is a controversial practice frowned on by many scientists and science instructors; while mathematical criteria provide an objective and quantitative method for data rejection, they do not make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.

Ok, we saw that warning.
“Rejection of outliers is more acceptable in areas of practice where the underlying model of the process”… very important.

The wikipedia article is also worth looking:

In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. This can be due to incidental systematic error or flaws in the theory that generated an assumed family of probability distributions, or it may be that some observations are far from the center of the data. Outlier points can therefore indicate faulty data, erroneous procedures, or areas where a certain theory might not be valid.

We will exclude the outliers based on the reasoning that in that high volatile 2008 crash environment, the system we want to model stopped working (our theory was not valid then). It was a systematic error in the day-of-the-week system during that period. To be candid, nobody really thinks that -7% and +11% daily gains can be explained by the day-of-the-week process. And because on that period our system suffered a systematic error, we feel it is validated that we can exclude those outliers as it doesn’t represents useful information for the ANN, who wants to learn the day-of-the-week process.

We decided about a fix 5% outlier threshold. If the absolute value of the next day %gain is bigger than 5%, we deem the sample to be an outlier.
We run different experiments, with different normalization settings. To emphasize that ANN cannot produce different results every time it is run, we present 2 different results if we can for any backtest. Remember, it is because of the initial randomization of the weights. And be very cautious about any ANN backtest presented in journals in which the author shows only a single (usually ‘successful’) run of his algorithm.

Test1: no normalization of the targets

That is quite remarkable results. The Total Return is improved from 40% to 200%, the CAGR from 11% to 26%, and the most promising is that the directional accuracy improved from 51.8% to 53%. Let’s see what happens if we test for normalized (detrended) training.
Test2: with normalization of the target

Not much difference compared to the non normalized case. A little bit better than the non normalized case considering the CAGR and TR, but the difference is not significant. Which is not a surprise considering that ‘normalization 3’ is equivalent to no-normalization. (actually: normalizing with itself)

The meaning of the different normalizations we tried (skip this, not too interesting).

Normalization 1:
subtract the mean of the non-decimated samples (fix 200 samples) before trainig
add the mean of the non-decimated last fix 140 samples after forecasting (outliers included)

Normalization 2:
subtract the mean of the decimated samples (maybe less than 200 samples) before trainig
add the mean of the non-decimated last 140 samples after forecasting (outliers included)

Normalization 3:
subtract the mean of the decimated samples (maybe less than 200 samples) before trainig
add the mean of the decimated (maybe less than 200 samples) samples after forecasting

Compare performance charts.
Before decimation:

After decimation

Because we used a large 5% outlier threshold, the 2 charts are the same at the beginning. It excluded no samples in the pre 2008 crash period. However, the 5% threshold means that on that period, there were occasions when 30 out of 200 samples were excluded as outliers. That was a very volatile environment.
Our notes are in the charts. The chart improves after the 2008 crash, because those outliers don’t distort the statistical distribution of the process.

In the future we may test outlier elimination with different thresholds, and we may go back with our backtest from 1990.We are happy with the 26% annual gain and the 53% directional accuracy. This is our best results so far. With 26% CAGR, there is a temptation to play it live. 🙂
And remember, we still use a very simple, 2 neuron network, with the simplest 1 dimensional input. With a little bit imagination, we can fantasize a great ANN prediction system when we start to add more input to the ANN, such as previous day %return, volume, VIX, RSI2, etc. As we add more and more inputs, we expect the ANN performing gradually better. However, we don’t rush. We take the time and enjoy the 1 dimensional case for studying, testing the ANN possibilities and limitations. And for that, there is no better way than the 1 dimensional input.

The bottom line: it was useful, so we will use outlier elimination in the future.


No Responses Yet to “Removing Outliers”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: