Ensemble aggregation methods 2


Because of the initial random weights, the different runs of the NN backtests give different results. Annoying, but random initial weight is a tool that helps to avoid to stuck in local minima. Let’s suppose we train 21 different NNs with exactly the same parameters. Some of them will be stuck in local minima, some of them will not. In this post, we study how we aggregate the forecasted values of the ensemble members.
We also show in the chart the Buy&Hold and the Naive Learners (both the 2 bins and 4 bins version) as we introduced them earlier.
For this experiment we set the parameters based on the result of the previous post.
OutputBoost = 1;
We are also curious how much changing of these parameters weaken our previous good results.

We used the following aggregation strategies:

1. Return the most frequent sign:

forecasts.Select(r => Math.Sign(r)).Sum();

Note that in our previous Matlab study on aggregation, we made a note that:
“//SumSign() has lowest STD than Avg() (we measured it)”
Therefore, we used the Sum(Sign) in all our previous experiments.

We favour that the number of ensemble members to be odd and not an even number to avoid the case when the positive forecasts equal to the negative forecasts and cancel each other out.

2. The average:


The obvious aggregation is the average. The disadvantage of it that very badly trained members can forecast outliers, very high values. Only one of this outlier is enough to completely distort the aggregated decision of the ensemble. For example, we have 21 members. 20 forecasts say: +1%, and 1 forecast say: -21%. The average is negative.
Obviously, we should select the positive direction, but averaging the forecasts will give a negative prediction of the ensemble.
No wonder, we don’t expect that this aggregation method would be the winner. But, let’s test it.

3. The best trained ANN:

The idea is that after training 21 NNs, we keep the one with the smallest training error as a forecaster:


The smallest training error means that this NN gave the smallest MSE error on the training set (200 samples minus the outliers excluded). Note that MSE says nothing about profitability or drawdown, so another potential method would be to keep the ANN with the highest profit or the lowest DD on the training set.
However, we haven’t implemented this strategy yet.

For each cell in the next tables, we did run 5 tests and averaged that performance metric. For each ensemble strategy we run 3 tests, so we have 3 cells with the same strategy. To check the consistency.

A. The target function is the next day %change
The performance if nEnsembleMembers = 5:
Portfolio Value and directional accuracy after 23 years:

The performance if nEnsembleMembers = 21:
Portfolio Value and directional accuracy after 23 years:

B. The target function is the next day %change direction (-1, +1, but normalized by STD normalization)
We obliged to test this, because our ensemble aggregations treat the forecasted output as sign only. So, we thought to test what happens when the output of the test set is the sign only too.
The performance if nEnsembleMembers = 5:
Portfolio Value and directional accuracy after 23 years:

The performance if nEnsembleMembers = 21:
Portfolio Value and directional accuracy after 23 years:

– the Naive Learners are deterministic.

the 4 bins Naive Learner is superior to the 2 bins version. Expected.

some stochastic NN version can beat the deterministic 4 bin Naive Learner. At first, it is good to see that it happens. Without this result, we may contend that we better don’t complicate the forecast by training a complex NN, but we should only use the deterministic NL. However note that even if the NL gives better performance than the NN in this case, we would choose the NN. The reason is that in this simple case, with only 1 dimensional input, and with a non-sparse input space, it is easy to create a deterministic NL forecaster. However, increase the input dimension to 10. Try to create a 4 bin NL for those 10 dimensions. Each dimension is divided to 4 ranges. For 10 dimensions, the number of hypercubes is 4^10 = 1,048,576. And consider we have only 200 samples in the training set. 99.9% of the NL statistics matrix would have no sampled value. What do you think about the prediction power of that discretization? 🙂
How would you predict for a testInput that belongs to a hypercube where there is no training sample at all? Choose the neighbours? But which neighbours? It is easy to see that we are stuck with this approach. This approach would be local.
The good thing about NN is that it can operate even in this complex (sparse and 10 dimensional) case.

– because of the previous post, we made the NN learner less effective by changing the outputBoost from 30 to 1, and the maxEpoch from 20 to 49, but still with this less effective (non fine-tuned, non parameter overfitted) settings, the NN is better than the deterministic NL (2 and 4 bins)

we had high hopes for the ‘best trained ANN’ ensemble strategy, but it didn’t work fabulously. In fact, it shows the worse performance.

‘the most frequent sign’ aggregation is still the winner. This is what we have used so far. We keep it.

it is preferable to learn targets as the %change and not the sign(%change). We give more information to the NN that it can use. In this case, more information to the NN means better prediction.


No Responses Yet to “Ensemble aggregation methods 2”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: