### Input encoding: 1 dimensional variations

Consider the abstract from the article “Neural network encoding approach comparison: an empirical study”

”

The authors report the results of an empirical study about the effect of input encoding on the performance of a neural network in the classification of numerical data. Two types of encoding schemes were studied, namely numerical encoding and bit pattern encoding. Fisher Iris data were used to evaluate the performance of various encoding approaches. It was found that encoding approaches affect a neural network’s ability to extract features from the raw data. Input encoding also affects the training errors, such as maximum error, root square error, the training times and cycles needed to attain these error thresholds. It was also noted that an encoding approach that uses more input nodes (more dimensions?) to represent a single parameter generally can result in relatively lower training errors for the same training cycles (but more epochs necessary to train?)

”

**In our previous studies, we mapped the days of the week to numbers. Our straightforward mapping was:
Monday: 1
Tuesday: 2
Wednesday: 3
Thursday: 4
Friday: 5**

The problem is that we have a cyclical input: after Friday the next days is Monday. However, we have to map this cyclical input to a serial input. In real life the neighbour of Friday is Monday, but in our encoding the Monday is not the neighbour of Friday. We have cut the cyclical chain somewhere in the circle. Cutting the Monday from the Friday was an obvious choice, but maybe not the best. In this post, when we flatten the cycle to a 1 dimensional line, we have to put the discontinuity somewhere. For example, we can cut Tuesday from Monday, so the Tuesday-Monday neighbourness relationship disappears.

Note that with this encoding, we don’t have a 0 as an input. Whether it is good or bad, we don’t know. But we assume it is insignificant, because the default newff() in Matlab uses ‘mapminmax’ function for preprocessing the input to -1..+1. We assume (because we debugged the code) that the input preprocessing works in Matlab. We know from previous studies that the output preprocessing doesn’t work in Matlab as it is supposed, so we better preprocess the output ourselves, but we omit preprocessing the input and let Matlab do it.

We tested various versions of encoding.

Let ‘i’ be a number from 0 to 4. We tested 5 versions with the following code:

`dateWeekDaysOrig = weekday(dates) - 1; % this maps the date to 1..5 (Mon..Fri)`

dateWeekDays = mod(dateWeekDaysOrig -1 + i, 5) + 1;

For** i = 0 case, the mod() function line does nothing. That is the case we used in our previous studies.**

For i = 1 case, the encoding is as follows:

Monday: 2

Tuesday: 3

Wednesday: 4

Thursday: 5

Friday: 1

As having a backtest we trust, we mirrored the input as well. This should have the same performance as the original version:

Monday: 5

Tuesday: 4

Wednesday: 3

Thursday: 2

Friday: 1

Our backest results the following measurements.

**This measurement shows that far the best way is the Modulus = 0 case, namely the encoding that we used in our previous studies.**

Why does it happen?

We approximate the daily %gain distribution by a nonlinear smooth function. This function tries to smooth out the big differences between the days. For example, if Friday is a huge down day, but all of its neighbours (Monday, Thursday) are huge up days, and **if we pick an encoding that the Friday is somewhere in the middle of the range (Friday = 2, 3 or 4), the huge down feature of the Friday will be smoothed out by its huge up neighbours**.

The situation is completely different if we allow the Friday to be at the edge (Friday = 1 or 5). In that case, the Friday down-ness will be affected only by 1 Up neighbor (, but not 2); That effectively can let the Friday be represented as a negative day.

Let’s analyze our backtested results. Why are these differences? For example in the Modulus = 0 case (original), the ratio of upForecast/downForecast for Fridays are up only 44%. (only 44% of the Fridays are predicted as Up days). However, this is the case, when the Friday is at the edge of the range. (Friday = 5).

In the Modulus = 1 case, Friday = 1, but let us see the Modulus = 2. In that case Friday = 2, so Friday is in the middle. In that case the ratio of upForecast/downForecast for Fridays are 54%. And we know from experience that Fridays are usually bad. This is because as Friday went into the middle of the range, the neighbour days elevate its approximated %gain.

As an example, see this two different encoding of exactly the same data and an estimated function approximation (just illustration) of it.

We plot the 10 years aggregate daily %gain for the case when **Friday encoded as 5.**

And when **Friday encoded as 2.**

We conclude that **the encoding mechanism does matter very much and previously we were very lucky** in our encoding mechanism. That was the most sensible encoding, because we assumed that the largest discontinuities in the market happens at the weekend, so the **most sensible way is to put the discontinuity between Friday and Monday**, so Friday and Monday are separated. In the future, we will stick to our previous encoding mechanism. We learned our lesson today. **Emphasizing that the encoding is also a parameter of the algorithm**, we introduce two additional variables in the code:

`encodingType = 1dimension/5dimension;`

encodingModulus = 0; % as default

Two notes:

– a GA (Genetic Algorithm) method can synthesize the optimal parameters in the future, so it is important that the GA can tweak the input representation in the future.

– the **MLP is sensitive to the input encoding very much**, but other algorithms, like a k-NN algorithm, or the previously introduced **GRNN with 0.1 spread should be insensitive** to it. That is the strength of GRNN we shouldn’t forget in the future.

Filed under: Uncategorized | Leave a Comment

## No Responses Yet to “Input encoding: 1 dimensional variations”