Plot the Encog NN surface, 3 insights

25Apr11

In this post, we look more closely at the trained Encog NN for gaining more insight about the inner workings of the NN. This is what many amateur NN developers usually neglect. However, we think it is cardinal to understand how the NN works. This is why we still use small, simple NNs, to have a chance to understand them.

1. Initiate weights to constant zero.

We had an idea that to eliminate the stochastic nature of the NN, we would initiate the weight and bias matrix to deterministic values. The obvious choice would be to initiate all of them to zero. In Encog, the network.Reset() function sets up the initial weight randomly.
However, if the initial weights are left to zero; (network.Reset() is not used), NN will have zero weights after the Resilient or the Backprog training is finished. Oops!! Only the last biasWeight will be non-zero after the training. This means the output will be the same regardless of the input. That is bad, meaningless training. This is logical, because the NN surface is constant, and there is no derivative of that function, so the training cannot move into any direction.
We may initiate the weights to some other constant (non-zero) non-random value, like 1 (or -1?), but it is easy to see that this would be arbitrary and would induce bias into the whole process. (The result would depend on that initial chosen value) We prefer non biased solutions.
It would be another parameter of the training and we don’t want to add another parameter. Keep it simple: Occam’s razor. Another parameter would only complicate the process.
– Not to mention that starting the search in the weight space from a deterministic point hugely increase the chance that we won’t find the global minima, only a local minima during the optimization process.
This suggests that honestly we cannot eliminate this source of randomness.

2. maxEpoch dependence.
So far this year, we used maxEpoch=20 as a parameter. This value was determined by trial & error. We did run a couple of backtest and thought this is a good value. (See previous posts.) Here we study how the NN surface changes as we increase the maxEpoch.

We use the first backtest day forecast (that is the day 201 in this 23 years data, because we use 200 days lookback) for plotting charts.
Averaging the outputs of the test set on day 201:
– when the input is under 0: -9.81
– when the input is over 0: 8.54
This forms the target function the NN wish to approximate.
Note that these output values are the boosted outputs, so it is futile to try to understand it as daily percent changes.
Based on these targets, we expect the target function T(x) to be:
T(x) < 0, if x 0, if x > 0
That is T a monotonous increasing function.

Let’s see the NN function, the nn surface that tries to approximate this. In the next charts, the X axis range spans from -10% to +10% as current daily %change. At least, it is equivalent with this range, but the normalization distorts the X values. (Just ignore it).
MaxEpoch = 10:

MaxEpoch = 500:

Note that the MaxEpoch=10 plot is not as smooth as the other one. It seems that the training process hasn’t converged yet. Also note the output range distribution. The smoother, more converged NN is more evenly fill the Y range (from -2 to +2). We like the maxEpoch=500 case better. It suggests that the NN has finally converged.

Let’s see some performance measurements as a function of the maxEpoch.
Portfolio Value$ after 23 years:

The PV shows that the optimal is the maxEpoch=21, however we think it is only a winner because of some randomness. (Overfitting).

Directional Accuracy%:

Take a look at the D_stat chart. In the 9-10-11 case, and in the 14-15-15 case, it is very unstable. That we don’t like. It means it is far for being converged. The 19-20-21 case seems to be acceptable, it is not too volatile, but to be sure, we prefer even greater values for maxEpoch.

Avg Training Error (in % as defined by Encog):

For us the answer is given by the Training Error chart. If we have much computational resources, we would train until 100 epochs at least, that is a good trade-off. Training even further, until 500 epochs gives less error, but the training time is multiplied by 5. Taking into account that increasing the maxEpoch from 50 to 100, requires double computational effort, we will suggest to use the maxEpoch=50, or the maxEpoch=40 case in the future.

3. Observe 40 different trainings.
Another proof against using only maxEpoch=20. (and preferring a greater value)
Take the first forecasting day (day 201).
Repeat ourselves:
‘We use the first backtest day forecast (that is the day 201, because we use 200 days lookback) for plotting charts.
Averaging the outputs of the test set:
– when the input is under 0: -9.81
– when the input is over 0: 8.54

On day 201, the current daily %change is negative (that is the testinput), so T(x) should be negative, as x is negative. Therefore, we expect that the NN should give a negative output.

We run 40 different training and measure the forecast value of the NN, both with maxEpoch=20 and 2000 as parameters.

29 out of 40 is negative. (in the maxEpoch=20 case)
40 out of 40 is negative. (in the converged maxEpoch=2000 case)

This also suggest that the maxEpoch=20 trained network is not yet converged.

So, as a conclusion, in spite of the fact that the maxEpoch=20 case produces the greatest profit, our reasoning to use a higher, more converged value:
– forecast instability (in this 40 different random training)
– training error is high and instable (see chart)
– D_stat is instable (see chart)

Note that increasing the maxEpoch is an O(n) operation. So, we cannot increase it too much. We would love to increase maxEpoch to 20,000 instead of 20, but that would mean 1000 times computational time. So, we suggest using maxEpoch=40 or 50 and we also suggest using several different NN in the ensemble. This will also assert that even if one training is stuck in local minima, other NN trainings may find the global minima too.

4. Tricky NN surfaces.
Just as an example, note a little extreme NN function. Sometimes, even after 500 maxEpoch, we got this function.
The training set was:
Average under 0: -0.048
Average above 0: -0.040

We guess that it is a local minima in the NN weight space. We doubt it is the global minima. However, for a set of inputs, it gives reasonable, correct response.
It is interesting that this kind of complex function can emerge even with a relatively low number of parameters: 2 neurons, 2 biases in the hidden layer, 1 bias in the output layer. This is only 5 weight values.

or another one:
Average under 0: 0.056
Average above 0: 0.023

We reckon that even if the average values under 0 is positive, the far negative values produce very negative results.
Easy to imagine: if today we have huge -4% losses that will induce that tomorrow we again have negative %change (because probably this is a bearish environment). Maybe not as big loss as today, but a small loss.

However, this will also occur if we train the NN not on the nextDay direction (binary training: train +1 or -1 in the training set), because even in that context, today negative values will induce that tomorrow we have another negative number (-1), because of the bearish environment.
So, because of this real life behaviour, we cannot expect that the NN surface will be a decreasing function.

En example of the NN surface for this binary output training (+1, -1 outputs, input is not binary) can be seen here:

Or another here
Average under 0: 0.010
Average above 0: 0.001
Note that both are positive, but honestly, don’t expect that the NN function will be positive everywhere:

This is amazing image. This behaviour (small negative is bullish, but big negative is bearish; small positive is bullish; big positive is bearish) that we expects most of the time and we explained in many of our post previously. This is an amazing image, because this continous smooth behaviour cannot be approximated by a simple 2bins Naive Learner method. We need non linear functions (like the NN tansig()) to represent this kind of behaviour. If you want to remember only one chart that what is the function that the NN represents, remember this one.

5. The outputBoost dependence.
So far we used outputBoost = 30. It effectively meant that the output was scaled in a way that the std.dev. was mapped to -30 and +30 (mean = zero). The usual NN literature suggests scaling the output SD to -1 to +1. However, by trial and error, we have found that outputBoost=30 gives the best profit.
This was probably on overfitting again.

In spite of it produces better performance; we will not use it in the future. It distorts the training. We will use outboost = 1, that is no boost at all.
See the LayerOutput values in the debugging process:

The output of the layers very quickly reach such high values that the TanSig() function produces either +1 or -1 (to its boundary range), and no in-between values will be produced by the tansig(). This is meaningless result again, because for all the input values, the output is the same. (the NN cannot discriminate) Note that this is occurred, because the very high values in the Weight matrix: -1154 and -994. For these high bias or neuron weights, the summation will be a very high value. And the tansig() virtually maps all the input values that are less than -4 or greater than +4 to its limit boundary value (+1 or -1). That happens very quickly as we increase the outputBoost. We want to avoid that the weight matrix (neuron or bias) has such a high values and that the NN cannot discriminate.
Because of that, look at the NN surface chart

It is not smooth, but zigzags as a staircase.
We learned another important lesson here.
In spite of outputboost=30 gave the highest profit, we object to use it in the future. We will not boost the output any more. (boost = 1).

These are the insights that are only got by rigorously debugging, plotting, inspecting and understanding the all little details of this very simple NN. And that is which is usually missing from other NN users work. They are just too lazy to spend time on understanding it. They only want to use it. However, NN is such a complex and delicate automaton, that we don’t think that great, reliable and stable results can be reached without understanding it.

In brief, we gained three insights here:
– we cannot eliminate randomness by initiating the weight matrix to zero.
– use maxEpoch=40 at least (or 50 or 100) instead of 20 (even if 20 produced the best profit).
– use outputBoost = 1 (no boost) instead of 30 (even if 30 produced the best profit).

Advertisements


No Responses Yet to “Plot the Encog NN surface, 3 insights”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: