Data normalization (minMax, STD) reloaded


About half a year ago, we experimented with the normalization of the inputs and outputs in the Matlab version. For example, we discovered that Matlab uses automatic normalization, but that is not adequate and Matlab’s target normalization was even buggy. We learned our lesson that time. Instead of relying on the normalization mechanism of the NN framework (Matlab, Encog), we have to do the normalization ourselves. It is especially true, because general frameworks know nothing about our special data.

Two of the most useful ways to standardize inputs are:
o Mean 0 and standard deviation 1 (we call it StD normalization)
o Midrange 0 and range 2 (i.e., minimum -1 and maximum 1) (we call in MinMax normalization)

We highly recommend for the reader to study these link from the Neural Network FAQ, in which a statistician tries to answer the question:
“Should I normalize/standardize/rescale the data?”
and here are the illustration images:
“There is a common misconception that the inputs to a multilayer perceptron
must be in the interval [0,1]. There is in fact no such requirement,
although there often are benefits to standardizing the inputs as discussed
below. But it is better to have the input values centered around zero, so
scaling the inputs to the interval [0,1] is usually a bad choice.

– we have to mention that we don’t want to subtract the mean, because we lose important information. (We tested it half a year ago in Matlab; we posted an article about normalization then)
from NN FAQ:

Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. ”
We reckon that subtracting the mean from the inputs changes a very important aspect: even a tiny shift can change an Up day to a Down day. And we think people treat emotionally differently the yesterday’s Up days even if they are up only a tiny amount. This is the opinion of other bloggers too (Micheal Stockes). Therefore, we never want to convert an Up day into a Down day category. At least, our backtest in Matlab (half a year ago) showed that we shouldn’t do that.

-Inputs in the range of -0.1..0.1 is bad for you.
Note this quote from the NN FAQ that applies to our case for example as our case (e.g. our currDayChange input of 1% was represented as 0.01 as a real number):
“It is also bad to have the data confined to a very narraw range such as [-0.1,0.1], as shown at lines-0.1to0.1.gif, since most of the initial hyperplanes will miss such a small region. ”
To illustrate why, see a couple of images:
Good distribution of initial hyperplanes:
Bad distribution of initial hyperplanes: (the training will be slow and more chance to stuck in minima)
“Thus it is easy to see that you will get better initializations if the data are centered near zero and if most of the data are distributed over an interval of roughly [-1,1] or [-2,2].”
“The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers. ”

– in this post we compare minMax and Std Normalization. Note that in this study, we didn’t re-center that data. Only scale (multiplication) was applied.
We have 2 parameters: inputBooster and outputBooster.
In the minMax normalization case, inputBooster = 10 means that all the input was multiplied by 10. We wanted an almost equivalent case to the StD normalization. As a trial, we found that the almost equivalent multiplier in the StD normaliziation was if the divide 10 by 7; This gave the base for comparision of the MinMax and StD normalization.

– Some parameters of the backtest:

int nRebalanceDays = 1;
int lookbackWindowSize = 200;
double outlierThreshold = 0.04;
int nNeurons = 2;
int maxEpoch = 20;

int[] nEnsembleMembers = new int[] { 11 };
int nTest = 7; // per cell

In the figures, the maximum values are bolded. The Average of the rows and columns are also presented.
See the TR (Total Return) chart for:
for the MinMax:

for the Std case:

We may compare the D_stat (Directional accuracy) for
for the MinMax:

for the Std case:

– if you look at the minmax versions in the dStat, you never find values above 57.0%. However, in the StD normalization case, there are 3 places where it is above 57%. That is a very weak proof that STD normalization is better than minmax. 🙂

– Another reason why we like the STD normalization:
Our inputs and outputs are not ranged variables, but random normal distribution variables. Because our input is currDayChangePercent, that is a random variable, we cannot determine the max. and min values. Therefore, it is better to use standardization (normalize to standard distribution). In another prediction task, when we would use the dayOfWeek as an input, (as that input is not a random variable; the max and min values are clear before running the process: 1 for Monday; 5 for Friday), we would use minMax normalization, not Std normalization.

– The output is always a random variable; we better use STD normalization.

– TR table, InputBooster:
For MinMax normalization, the best input booster is 10 (minmax2)
For StD normalization, the best is the inputbooster=1.4 (std3)

– TR table, OutputBooster:
For MinMax normalization, the best output booster is 50 (minmax2)
For StD normalization, the best is the outputbooster=28.57 (std3)

– One problem we didn’t understand at first: the output has to be scaled up by 50 to be optimal. Weird, isn’t it?
The question was raised. Can the NN generate outputs above 200?
For example: with outputBoost=100, almost all training output will bigger than 1; Our activation function however can generate values only in -1..+1.

– At first, we thought we equivalently convert outputs to +1,-1; maybe it is equivalent to learn the sign as output. Backtested: No!, learning Sign() output only results TR =25,000, 11,000 (in 2 backtests).

– Our problem solved: NN can generate output = 200, because our output layer activation function is linear; (we have 2 neurons) So, the output = ActivationFunction(bias_output * weight_output + neuron1_output*W_neuron1 + neuron2_output*W_neuron2); Our activation function is linear; our weight can be anything; so our output layer can generate values above 200; Note, that our neuron1 and neuron2 output are ranged into -1..+1 (because of the tanh() activation function); If we have a proper training the output scaling shouldn’t count. Still, the optimal outputBooster is 30 * the STD. That is still a mystery, why it is the case.
From the NN FAQ:
“If the target variable does not have known upper and lower bounds, it is not advisable to use an output activation function with a bounded range. You can use an identity output activation function or other unbounded output activation function instead; see Why use activation functions? ”
This is exactly what we used: having no activation function, our output is not bounded.

– We shouldn’t mention that TR=87,000 performance means, it multiplies the initial $1 by 870 in 23 years (35.46% CAGR). The reason is that these kind of backtest results are very much parameter fine tunings and fine tuned successful past parameters can never give the same ‘best’ result in the future. So, this very successful result is only theoretical.

– as a curiosity, we also tried to cap (clip, clamp) input values larger than 1 STD away to be 1 STD. Our backtest show, it didn’t perform as well. It showed -5% to -20% TR performance compared to the case in which we only did StD normalization, but we didn’t clip the inputs to 1 STD. It shows that by capping inputs, we lose precious data that can be useful in the prediction. So, in spite of this is mentioned in a couple of places, we don’t advice to use it.

– Conclusion:
we would like to use the StD normalization instead of the minMax normalization, because
– we find it a little bit better as performance,
– it is more suited to our input (if our input is a ‘normal’ distributed random variable), so the max, min values cannot be determined
– it is generally suggested by the literature
– we will use inputBooster = 1 (for the Std normalization).
– we will use StD normalization for the output too. With outputBooster = 30;
We also note that this seems to be a parameter back-fitting (in hindsight); This value showed best in backtest; The reason of its surprisingly high value could be numerical precision issues in the special training process.


One Response to “Data normalization (minMax, STD) reloaded”

  1. 1 Jack Sadowski

    Good study and well presented.
    Is there a chance you may post the code used, so readers may make their own comparisons? I think that would spurr more discussion.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: