Our conceptual framework is different from the link above.

Here is the outline of a system that can help in stock market decisions. One of best way to illustrate a concept is the Flow Chart that usually tells the viewer the flow of information or sequential steps.

Here is our future conceptual framework:

The system is general enough that it can work in the prediction of the SPX, the RUT, the VIX, the EUR/USD ratio or anything else.

Let’s detail the bubbles in the flow chart a little bit more:

**1. Learning probability distributions from historical data.** Here we use the term Machine Learning in a general sense. We want to extract some useful prediction from historical data using the machine, the computer.

Let’s imagine a simple system that we usually don’t regard as a Machine Learning system.

Imagine that today, the SPX is over its 200 days SMA(200) (Simple Moving Average). A technical trader wants to take a bet for tomorrow’s market direction. How does he determine that he goes short or long?

He looks back at the last 20 years or 100 years history and based on that he calculates that every time the spot SPX was above the SMA(200), its next day return was 0.1% in average, and when it was under the SMA(200) its next day return was -0.2%. (These numbers are illustration only).

So, our technical trader ‘learns from the past samples’ that the expected profit for tomorrow is positive, so he goes long next day and buy SPX futures.

This is a very simple ‘machine learning’ system that everybody can ‘calculate’ it in Excel in about 1 hour.

**Our conceptual framework is general enough that we didn’t fix the machine learning method. It can be a simply Excel calculation as mentioned before, or it can be a Linear Regression, a Neural Network, an SVM (Support Vector Machine) or a genetic algorithm, anything.**

The different learning algorithms create different mathematical models.

We try to use a machine learning that is deterministic, so the same result can be reproduced with different backtests, but deterministic nature is not a must.

**We prefer those machine learning algorithms that learns a probability distribution function (PDF), and Excel is not like that.** The reason is that in one of the next step, we need not only the mean return, (anyway we prefer the median return) and we also wish to obtain the expected volatility, the standard deviation.

Therefore we prefer to work with probabilities.

**We think the biggest mistake researchers generally make is that they research only the mean return, the expected profit or loss, but they fail to determine the expected volatility. For example, the previously mentioned SMA(200) days crossover method. We obtained that the profit is positive above the SMA(200), but we know nothing about the volatility. We contend that by forcing volatility down can even increase the profit in the long term, which contradicts the general efficient market theory. That says that the more profit we expect, the more volatility we should suffer.**

**2.**

**The ‘new’ classical probability approach (by Irving Fisher) uses only objective information coming from historical observations. That is the model that we have built in step 1. However, before the Fisher method of probability, others, like Bayes 100 years earlier regarded that probabilities should be defined subjectively, based on prior beliefs.**

**Imagine the Nassim Taleb’s turkey example.** A human farmer feeds the turkey nicely every day in the last 1000 days. The turkey is very happy with his friend: the farmer. However, on day 1001, it is Xmas time and the farmer comes with a knife, instead of food. A Fisher following mathematician would build up the turkey probability model only using the historical data, the last 1000 days of observation.

Fisher wouldn’t use other ‘fundamental information’, like the ‘general knowledge’ (belief) that

– farmer’s turkeys are eaten at the end anyway (almost without exception), as that is the purpose of raising turkeys.

– as the Xmas day is coming there are higher and higher probability that the farmer brings the knife instead of the food.

**These fundamental knowledge doesn’t fit into the Fisher approach, but Bayes and Pascal would happily use this information too to build up the PDF (Probability Distribution Function).**

**Plato had no idea about PDFs, but even for him, this would have seen the better approach. Probability distributions are eternal objects. They exist irrespective of the observations. No matter how many observations we make, we cannot get to know the probability distribution totally through this way.**

**This is especially true for extreme values, outliers, power law distributions, in which extreme events occur very rarely, so observing them is quite difficult or impossible.**

**We can call this ‘new’ (rather old) approach belief based, subjective or non-objective probability. We tend to prefer the term ‘non-objective’, but it is only personal taste; they mean the same.**

**What do we mean by subjectivity** (not objective) in the concept of stock market prediction?

Things that effect the PDF, but that cannot be observed through the historical samples of the last 3 years. **Examples like:**

– Events like USA presidential election in the next week (because the last 3 years data doesn’t contain the previous one)

– our general belief (by reading news and media) that the iPhone sales are ‘probably’ very good, because there were long queues in front of Apple shops.

– Mario Draghi makes a speech that he is willing to do everything to save the EU currency, and we believe he will do

– our belief that Cloud computing will be a big success in the future, so all cloud companies will perform better than other technology companies.

Because these fundamental things cannot be expressed through historical observations in step 1, we include these effects into our mathematical model here in step 2.

But how? It is not easy.

In step 1 we synthetized a probability distribution based on historical samples. We can use Gaussian distributions, log-normal distributions, Levy stable distributions, etc. If we use a Gaussian distribution, it can be described by 2 parameters: Mean, StDev, therefore our belief in step 2 can modify these parameter values. For example, our bullish (bearish) belief can increase (decrease) the Mean. If we expect higher volatility (in the case of coming USA election), our belief increases the StDev. If we expect lower volatility (because ECB starts to buy Southern country bonds), we decrease the StDev. Unfortunately, we prefer to work with log-normal and Levy stable distributions. Those have more obscure parameters, and therefore it is not so easy to express our belief as we mentioned here.

There is another question too? How much should we change these parameters?

No general, formalized answer for this.

We suggest modifying a little bit, and run 100K simulations based on those new parameters and calculate the CAGR, maxDD, StDev of the PV, to see the effect of those modifications.

**3.**

Using our non-objectively modified probability distribution we** generate 1 million samples for the next day return.**

**4.**

We determine Mean, Median, StDev and other statistics for the next day return from the simulated samples. **You don’t have to do it for the Gaussian distribution, but you have to do it for all general probability distributions.**

**5.**

Because we work with time series and we place bets every day, even if we have a positive expected next day return, it doesn’t mean we should place a bet.

See the volatility drag and the previous post for mare explanation.

In short, if the volatility is high, it is better to stay in cash, even if the expected profit is positive.

In step 5, **based on the StDev we determine the minimum threshold for the Mean. If the simulated Mean/Median is smaller than this threshold, we stay in cash**.

For example: with 4.5% daily StDev, the volatility drag is 26% annual, so the threshold is 0.1% daily.

It means if the simulated Mean/Median is positive, but less than 0.1%, we stay in cash, and don’t go Long. Similarly, if the simulated Mean/Median is negative, but more than -0.1%, we stay in cash, and don’t go Short.

For determining these thresholds we can use the already generated 1M simulation samples, and assuming that these samples make a long time series.

**This system can be called Conceptual Framework, but we prefer to call it “Machine learning based, non-objective probability-model for regular and extreme values of stock returns”.**

**We build up and use a probability model that can use non-Gaussian heavy tail distributions. Generating 1 million simulations or more — because this system is simulation based — we can model not only Gaussian, but extreme long tail stock market moves too.**

]]>

**Can the volatility drag quantified by the simulation that was performed in the previous post?**

Let’s construct the toy SPY model in a way that the** Expected %change is a positive constant every day, but very, very close to zero. (For example = 0.00001)**

A naive observer would say that in this case, the Buy & Hold strategy would be profitable, since every day has a positive expected outcome, so it is worth taking a Long position in equities.

That is not the case.

And we show here that the outcome largely depends on the SD.

**Assuming Gaussian distribution, the real world SPY has a mean 0.000384, that is 0.0384% and a Standard Deviation (SD) of 0.0124, that is 1.24%.**

Let’s run our toy SPY generation process:

– **assuming SD of 4.5% (that mimic the SPY Triple ETFs)**, let’s construct a time series 100 times, and average it.

The** CAGR of the toy SPY is -23%.**

It means that if we bid randomly on the outcome, and our daily expected profit is zero, we should expect -23% annual capital loss every year.

– **assuming SD of 3% (that mimic the SPY Ultra (double) ETFs)**, let’s construct a time series 100 times, and average it.

The **CAGR is -11%.**

It means that if we bid randomly on the outcome, and our daily expected profit is zero, we should expect -11% annual capital decrease every year.

–** assuming SD of 1.5% (that mimic the SPY non leveraged ETFs)**, let’s construct a time series 100 times, and average it.

**The CAGR is -2.7%.**

The loss strongly depends on the SD. However, there is good news here.

**If we have an instrument that is not much volatile, that has a SD of less than 1.5% (as for the SPY), we don’t have to worry too much about the cash position.** This simulation shows that if the probabilities are for us (and not against us), if the Expected profit is above zero, even just slightly above zero, we can go into the position (long or short) full size. **Not going to cash is forgivable, because the maximum we can lose is the -2.7% annual loss of the volatility drag.**

**However, with the triple ETFs the situation is different. The expected profit should compensate for the -23% annual loss of the volatility drag.**

**As the currently popular VIX volatility products (ETFs, futures) can have a Beta of 2 or 3 compared to the SPY**, the cash position should be a frequent position of any strategy that plays the VIX.

**If the expected profit on the next day is not greater than 23%/250= 0.1%, the strategy should favour the cash position for the sake of CAGR.**

**If our utility (goodness) function is not the CAGR, but the Sharpe**, in which the volatility counts too, we would say that this threshold should be even **greater than 0.1% (maybe 0.15% or 0.2%) for a strategy to dislike the cash position.**

]]>

These strategies are currently retired, but audited performance result can be obtained from here and here.

There were ‘by and large’ daily MR strategies. To be honest, that is a crude simplification, because YK was a learning algorithm, but because daily MR was very successful in 2008, the YK strategy played mostly that.

**A debate was risen that how to play correctly a Mean Reversion signal.**

**If SPY on a given day drops -5%, it is clear that a Mean Reversion strategy would go long at the end of the day to prepare for the bounce tomorrow.**

**However, if SPY drops only 0.10% our human intuition says that it is not a strong signal to go long tomorrow.**

**Should we go to Cash if the MR signal is weak?**

**Michael admitted that with a close to zero, like 0.1% daily change weak signal, the expected profit next day is quite small, but he insisted that we should go long even in this case, because the Expected Value of the profit is still positive.**

In this post, we construct an artificial SPY example with known probabilities. Our mathematical model will be a stochastic process, so the outcome of every simulation is not deterministic, but random. However all probabilities are known.

**We will play a daily Mean Reversion (MR) on this artificial SPY stock that we construct.**

**Our aim is to refute that claim: we show that even if the expected profit is positive, we better stay in Cash on a weak MR signal.**

**At first let’s construct our SPY stock.**

**Assuming Gaussian distribution, the real world SPY has a mean 0.000384, that is 0.0384% and a Standard Deviation (SD) of 0.0124, that is 1.24%.**

In our mockup SPY, we create an SPY with SD = 3%. (assume Beta = 2 compared to the real world SPY, and YK strategy played the Ultra ETFs anyway), and with a mean, that is not fixed, but varies.

We want to play daily MR on this stock, so we construct an SPY that has the following mean as %change for the next day:

Note that the mean is known every day, but the proper next day outcome is not known, so we still have a random process. After the Mean is determined by this function, the actual next day %change is determined by a Gaussian process with this Mean.

Note the strong Mean Reversion feature of the generated time series: when previous day %change is negative, we generate a new day with a positive expected value.

**Let be f() a stochastic function that transform the %change of the previous day to the %change of next day for SPY**, we can say that:

** E(f(x) | for all x, where x < 0) > 0** , which means negative days imply that the next day change has a positive Expected Value

and

** E(f(x) | for all x, where x > 0) < 0** , which means positive days imply that the next day change has a negative Expected Value

**Perfect instrument for playing daily MR.**

Our next day SPY price is calculated by

**SPY = SPY * (1 + f(x)),**

**where f(x) = N(mean for next day, SD);**

where N(mean for next day, SD) is a Normal distribution process with the specified mean and SD.

We have a **threshold of -1% for the strong mean reversion.** If today %change is less than that, we have a strong MR on the next day, because our Expected %change for next day is +0.3%.

The whole f(x) function is similar to a -X function, except that **between -0.5% and +0.5% it is very, very close to zero.**

We say we have** weak MR signal** when the %change of the previous day is in this range.

Michael can argue that even in this region, the expected profit for next day is positive with MR strategy, but we will show it is not the case.

**The expected profit is positive; yes; but it doesn’t imply we should take a position other than Cash.**

**(the Expected profit of the next day is positive, yes, but the Expected profit of the MR strategy will not be)**

We generate two strategies: one (**Strategy1) is a pure MR** that goes long if the previous day was negative and goes short otherwise.

The other strategy (**Strategy2) goes to cash in case of week MR signal, between -0.5% and 0.5%**.

We generated 100,000 days of data that would equal to 400 years of stock market days.

One run of the simulation is charted here: (click for better image)

The outcome of the stochastic simulation is random; therefore we repeated the simulation 100 times to got reliable (not too random) results. The presented statistics is the average of those 100 simulations.

The **MR+Cash position strategy is in cash 13% of the time.** This is with the default 3% SD of the SPY generation.

(In case we use 2% SD for SPY generation, we are in cash 20% of the time.)

**Conclusion:**

– Because of the cash position, we are not surprised too much that the **2.8% SD of the MR+Cash strategy is less than the 3.0% for the pure MR strategy.**

**What really surprising is that the profit, the annual CAGR is higher too (37.01% instead of the 35.04%)**.

It is higher** in spite of the fact that the Expected Profit was positive on those days that we replaced by cash position.**

**So we missed some positive profit in the MR+Cash strategy compared to the MR, still got better profit.**

**How is it possible** that even with those missed profit, the CAGR is higher in the MR+Cash strategy?

**The answer lies in the volatility. And the fact that we are talking about a time series, a.k.a. a sequence of discrete days.**

If our job would be to bet on the MR outcome on a weak MR signal just once, only on a single day in our lifetime, we would bet on playing the MR strategy (and no Cash), because the expected outcome is always positive if we play the MR.

However, now we are dealing with a time series, and the daily aggregation (multiplication) of every day simulations. In this case, not only the expected value, but the SD of the time series matters too.

In this case, we would omit using the MR strategy on weak MR signals, and we would stay in cash.

As the simulation shows this decreases the volatility and increases the profit too.

– With both better CAGR and better SD, no wonder the Sharpe is increased from 11.68 to 13.23.

– The answer to the mystery is that **the time series profit is decreased by the volatility drag of the time series.**

**– When we bet on the outcome of the next day, the expected profit should be higher than a threshold to compensate the volatility drag. And this threshold for the expected profit should be significantly higher than zero.**

– Our world is not simple black and white. **If it doesn’t worth to short SPY, it doesn’t imply we should long SPY. There is a fine line between the two, where no short, no long positions worth taking. We better stay in cash. The higher the volatility the larger is the region around the decision boundary of the expected profit where we should stay in cash.**

**– the cash position is generally preferred when we are uncertain about the outcome. Cash position is a good risk mitigation tool too.**

]]>

There is a distribution called Levy distribution

http://en.wikipedia.org/wiki/L%C3%A9vy_distribution

which has 2 parameters and it is not really what we are looking for.

A generalization of it is the Levy alpha stable distribution:

http://en.wikipedia.org/wiki/Stable_distributions

(a quote from the wiki page that is relevant to our case

“

It was the seeming departure from normality along with the demand for a self-similar model for financial data that led Benoît Mandelbrot to propose that cotton prices follow an alpha-stable distribution with a equal to 1.7. Lévy distributions are frequently found in analysis of critical behavior and financial data (Voit 2003, § 5.4.3).

“)

`The Levy alpha stable distribution has the following 4 parameters:`

alpha = 1.5; % characteristic exponent, and describes the tail of the distribution

beta = 0; % skewness, asymmetry

gamma = 1; % scale, c, (almost like variance),

delta = 0; % location, (almost like a mean),

A good summary is here:

http://math.bu.edu/people/mveillet/html/alphastablepub.html

that shows that the Gaussian, Cauchy, simple Levy are all special case of the Levy alpha stable distribution.

Also, that link shows a package that in theory could be used to calculate the parameters from the samples, or calculate the PDF, CDF from the parameters. Unfortunately, that Matlab code is buggy, so we couldn’t use it to estimate the parameters.

However, we later used it to generate the PDF and CDF from the parameters.

Luckily, we found another software package that seems to work for generating the 4 parameters from the samples:

1.

Let’s see how to code it in Matlab:

1.1. Generating the parameters:

`aMean = mean(pChanges);`

stDev = std(pChanges); % it uses the n-1 as a denominator

params=alpha_loglik(pChanges);

disp(sprintf('The optimizing value of alpha is: %d',params.alph));

disp(sprintf('The optimizing value of beta is: %d',params.bet));

disp(sprintf('The optimizing value of gamma is: %d',params.gamm));

disp(sprintf('The optimizing value of delta is: %d',params.delt));

1.2 Plotting the PDF:

x=-0.59:0.01:0.39;

yGauss=gaussmf(x,[stDev aMean]);

plotGauss = plot(x,min(yGauss, 5.2));

set(plotGauss,'Color','green','LineWidth',2)

yAlphaLevy=stblpdf(x,params.alph,params.bet,params.gamm,params.delt,1e-12);

plotLevy = plot(x , min(5.2, yAlphaLevy ./ max(yAlphaLevy))); % normalize maximum to 1

set(plotLevy,'Color','red','LineWidth',2)

xlabel('Gaussian vs. Alpha Stable Levy'

1.3 Calculating the CDF:

cdfGauss = normcdf(x,aMean,stDev);

cdfLevy = stblcdf(x,params.alph,params.bet,params.gamm,params.delt,1e-12);

2.

There is a question that what are the synthetized parameters of the Levy alpha stable distribution for AAPL daily %change? Here they are:

params =

alph: 1.6228 % characteristic exponent, and describes the tail of the distribution

bet: 0.20171 % skewness, asymmetry

gamm: 0.016158 % scale, c, (almost like variance)

delt: 0.0015028 % location, (almost like a mean),

Alpha is 1.62. So it has a long tail. It is comparable to the cotton price alpha of 1.7 that was calculated by Mandelbrot.

Beta is 0.2, there is some positive skew, asymmetry. No wonder, since Apple stock prices trended up mostly, and in general as the stock market is trending up, there are more Up days than down days.

The Delta is 0.0015, that is not exactly like an arithmetic mean, but you can interpret it that the daily %change is about +0.15% (a positive number). Again! It is not a mean! Alpha stable distributions hasn’t got a concept of mean; The mean is not determined, because the mean is not stable. Just remember that the Cauchy distribution has infinite variance, and therefore undetermined mean. (We can talk about the median though)

3.

Let’s see visually how the Levy alpha stable distribution fits to the real life samples. So, plot the PDF of the samples (blue bars), the Gaussian (green line) and the Levy alpha stable (red).

It is amazing how nicely the Levy version fits the samples. In contrast the Gaussian estimation looks clumsy.

It seems that in the center part of the plot, the Levy is under the Gaussian, however, we know that at the tails, the Levy should be above the Gaussian, since Levy correctly estimates the ‘fat tails’ of the distribution. So, let’s zoom to 0-0.2 range to see when the two distributions cross each other.

4.

As an illustration what is the difference of probabilities at the tail, when using Levy vs. Gaussian.

For example, let’s go back to the day, when AAPL dropped -52% on a single day.

The PDF at -0.52 is:

Gaussian: 1E-60

Levy: 0.0016 = 1.6E-3

That is much of a difference.

Note, it is the PDF! (not the CDF), so don’t use it for calculating chances. It only illustrates the difference of the two. And that the Gaussian PDF is so small, that no integration of those small values can result a significant probability (CDF) at that level.

5.

We have to confess that in the previous post, we used the PDF for probabilities calculation. That was wrong, but after recalculating those numbers, the main message is still the same. We partially amend that in this post. Now, we correctly use the CDF for probability calculation.

- p(-10% drop)=

Gauss: 0.04%, every 2500 trading days, every 10 years

Levy: 0.7% // (every 140 trading days; about 2 times per year); yes; fundamentally, it is possible, because there are 4 earnings dates per year - p(-20% drop)=

Gauss:1.3e-11, about once in every 1e+11 days.

As Earth is 10^12 days old, it can happen 10 times in the lifetime of the Earth.

Levy: 0.02% // every 500 trading days: every 2 years - p(-52% drop)=

Gauss: 5.6e-67, more than the lifetime of the known Universe

Levy: 0.045%: every 2200 trading days; every 9 years - p(+33% gain)=

Gauss: 1.07e-26; about once in every 1e+26 days.

Levy: 0.15%, every 666 trading days, every 3 years; Maybe that is an exaggeration.

We let the reader decide which mathematical model (Gaussian or alpha stable Levy) fits the real life data better.

Just for curiosity, according to Levy, every day

– there is 0.74% chance of a -10% drop

– there is 1.1% chance of a 10% gain (strange asymmetry); One expect the chance of the same percent gain to be less, because drops are more violent;

That is true, but in general AAPL trended up, therefore the whole distribution skewed to the right: more samples show gains than losses; that is the reason;

Obviously, if a stock goes up a little in every 99 out of 100 days, the distribution is skewed to the right.

About the same thing, but in another words:

On every single day:

– there is 1% chance of a -8.5% drop

– there is 1% chance of a 11% gain // there are more gains than drops

(1% chance: realistically happen every 100 trading days = 5 months)

It means that any Good Risk Management strategy should consider that

– a -10% drop can occur twice per year (Gaussians thinks it happens every 10 years),

– and a 20% drop can occur every 2 years. (Gaussians thinks it is impossible)

Conclusion

This post is tries to be a similar eye opening material in AAPL price changes as Mandelbrot’s book ‘The (Mis)Behaviour of Markets’ in many real life events.

We showed how useless is the Gaussian based risk estimations and Gaussian based probability and likelihood calculations in real life stock price estimations (Apple). A much better estimation is based on Levy alpha stable distribution.

]]>

In this case study, we looked at a specific stock, Apple (ticker: AAPL).

See the historical Apple price chart here:

We took the daily historical adjusted close prices, and then we calculated the daily %changes from it.

Let’s see how the distribution of the daily price %changes fit the Gaussian curve.

Since Apple IPO, we have 7000 days of data, which is 26 years.

The MATLAB code is not too difficult:

”

pChanges = closePrices(2:end) ./ closePrices(1:end-1) – 1;

aMean = mean(pChanges);

stDev = std(pChanges); % it uses the n-1 as a denominator

figure;

hold on; % plot 2 time series on each other

[nInBins, xout] = hist(pChanges, 600);

nInBins = nInBins ./ max(nInBins) .* 2.0 ; % convert the max to 2

bar(xout, nInBins);

x=-0.59:0.01:0.39;

y=gaussmf(x,[stDev aMean]); % generate Gaussian

plot(x,y)

”

The produced chart is here (you have to click it to see it properly in its full size).

What can we observe?

**The mean %change is 0.12%. The stDev is 3.02%.**

That looks quite a lot of standard deviation. It would mean that

– the price of AAPL changes more than 3% only 31% of the time (every 3rd day, ZScore = 1), or equivalently,

-the price of AAPL changes more than 6% 5% of the time. (every 20th day, ZScore = 2).

So, someone can argue that the stDev number: 3% shows that it is very volatile.

However, even this seemingly high volatility model cannot explain the AAPL real life price behaviour over the years.

Specifically, it cannot explain a -52% drop in a single day for example.

Let’s see some historical events in the Apple stock price:

1.

**Worst day: 2000-09-29: -52% single day loss.**

Apple had a grim earnings report on that day and it triggered many downgrades.

It was a brutal day for Apple:

”

Shares of the Cupertino, Calif.-based company** fell $27.75, or nearly 52 percent, to $25.75. Volume topped 132 million shares, more than 26 times the stock’s average daily volume of about 5 million shares.** Analysts at nearly a dozen financial institutions downgraded Apple and penned scathing reports on the company.

”

The generated Gaussian function (that fits to that mean and stDev) says that the probability of this is

P(-52% daily loss) = 2.4 * 10^(-65). (In a scientific notation it is: 2.4e-65).

It is a very, very small value.

**In average, this loss should occur every** 1/2.4*10^65 days. Let’s say, it realistically occurs every** 10^65 days.**

Just to illustrate how big value is this:

How many days old is the earth?

**Earth is 4.5B years old** that is 4,500,000,000 x 365 days = 4.5*10^9*365= 1.6*10^12 days.

So, **Earth is about 10^12 days old, and the event that Apple stock price drops -52% should occur every 10^65 days.**

**It shouldn’t have occurred in the lifetime of the Earth!**

Do you think there is a problem with the Gaussian mathematical model to describe financial data, or do you think the Gaussian function properly models real life events?

2.

**Second worst day: 1987-10-19, -25% single day loss.**

This was the famous Black Monday (1987) day when the Dow Jones dropped -22% on that single day. In itself, it was a Black Swan event.

The generated Gaussian function (that fits to that mean and stDev) says that the probability of this is

P(-25% daily loss) = 9.7 * 10^(-16). (in a scientific notation it is: 9.7e-16).

In average, this loss should occur every 1/9.7*10^16 days. Let’s say, it realistically occurs every 10^17 days.

(Again: Earth is 10^12 days old).

3.

**Best day: 1997-08-06, 33% single day gain.**

The event for the day was the following:

”

1997: Microsoft rescues one-time and future nemesis Apple with a $150 million investment that breathes new life into a struggling Silicon Alley icon

”

The generated Gaussian function (that fits to that mean and stDev) says that the probability of this is

P(33% daily gain) = 1.9 * 10^(-26). (in a scientific notation it is: 1.9e-26).

In average, this gain should occur every 1/1.9*10^26 days. Let’s say, it realistically occurs every 10^26 days.

(Again: Earth is 10^12 days old).

4.

**Second Best day: 1997-12-31, 24% single day gain.**

P(24% daily gain) = 2.7 * 10^(-14). (in a scientific notation it is: 2.7e-14).

**In average, this gain should occur every** 1/2.7*10^14 days. Let’s say, it realistically occurs **every 10^14 days.**

**(Again: Earth is 10^12 days old).**

In other words, **if Earth’s lifetime is 100x times bigger than it is, this event should occur only once, only on 1 day.**

Conclusion:

After this data, we contend that the price **time series of stocks doesn’t fit into the Gaussian model.**

Financial time series doesn’t belong to the world of Mediocristan. Unfortunately, the general mathematical models of risk that is used by banks, hedge funds or regulators are based on Gaussian distribution. We conclude that real life price series doesn’t work according to the mathematical model.

We would urge the investigation of other risk models: Levy-distribution, power laws or Mandelbrot’s fractals that we reckon would better fit real life data.

]]>

There is one idea that helps similarly as a stop loss, but it exactly tells the trader when to enter the position again. It is a trailing indicator called ‘playing the equity curve’. (Albeit, it has various other names in other terminologies). The ‘equity curve’ is the Portfolio Value curve of the strategy. It can be any strategy (simple or complex, it doesn’t matter).

The basic idea is to** play the strategy if it is above its 200 days moving average, and go cash or play the inverse strategy otherwise.**

The method follows not one, but three portfolio value (PV) charts.

– Original PV

– EMA PV (Exponential Moving Average of the original PV)

– Played PV (is played in real life)

It seems it is a trend following method. When our strategy has strength, we play it, when it has some weakness, we stop loss it. All with a lagging indicator.

It only has 1 additional parameter: the lookback days of the EMA. (SMA can be used too).

Here is the original equity curve of a strategy over 8 years. It is a special volatility strategy, but in the context of this blog post, it doesn’t matter.

The human eye can ‘clearly’ recognize some patterns: the strategy worked in the first 2.5 years, then it stopped working for 3 years, then it worked again for 2.5 years.

**We see 3 market regimes accordingly. In regime 1 and regime 3, we should play the strategy, and we should sidestep regime 2.**

**The ‘playing the equity curve’ technique will be a great help. Won’t be?**

**Version 1:**

Be in Cash on the downside.

Let’s see when it is applied for 50, 75, 250 EMA values when under the EMA curve we are in cash.

(Click on the image to see it properly)

Did it help with the drawdown (DD)? Yes.

Did it hep with the profit? Not really.

Just let’s imagine that you started this strategy and in half a year it doubled your investment, then it didn’t give any profit for 6 years. Would you play this strategy? No.

**The problem is that regime 2 becomes a too ‘neutral’ territory for our strategy. In regime 2 our strategy was not a winner, but neither a loser. The original curve flatlined. And the EMA curve fitted onto it**. Using a 75 days EMA parameter, in 75 days the EMA curve reached our original equity curve. In the next 4 years, we treaded water.

**Version 2:**

Let’s see the same EMA parameters, but instead of being Cash under the EMA, it plays the inverse-strategy.

We have huge drawdowns.

In theory, it seems that a long term **EMA like 250 helps a little more, because there are less whipsaws.** That is because the smooth EMA250 line only slowly reaches the equity curve of the original strategy in the problematic regime 2.

There are 2 problems with the feeling that the parameter 250 is the best and that we should we use this in the future. One is that selecting an optimal historical parameter is a kind of parameter overtuning. Because of some random chance, it turned out that this parameter was better.

Another one is that there is no guarantee that future bad (neutral) regimes (like regime 2) will take 3-4 years (as it was for regime 2), therefore it is unlikely that the EMA250 will be the best parameter in the future.

**Conclusion:**

I don’t have the magic solution right now.

The ‘playing the equity curve’ technique** helped a little on the profit, a little on the volatility, but it was far from the success I expected.**

One of the main problems is that **investors would very likely stop the strategy after 4 years of treading water.** I expected more from this technique. Maybe I expected too much.

]]>

In further detail, we are going to compare

**1.**

**Linear regression**

– based on the normal equation, deterministic evaluation

**– it is not iteration, so it instantly finds the exact optimum; so iteration number is not a parameter**

– requires no normalization of inputs, outputs (so there is no Normalization as a parameter)

– has only 1 parameter: lookback days

**2.**

**Logistic Regression (the name shouldn’t mislead you, it is a classification), binary**

– 2 categories: Buy or Sell (these categories are defined by the %change threshold of 0%)

– gradient descent iteration parameter: 400 (probably, it is enough, because in linear tasks, the Cost function is convex, so there is only a global minima)

– in theory, normalization is a parameter, because we do gradient descent iteration. Normalization would help the gradient descent to converge faster.

However in this case, with this very simple convex Cost function and because the range doesn’t differ too much from the ideal -1..1 range,

(our range is -0.2..0.2 (in a range of -20% to +20%)), we could normalize by a x5 multiplier, but that wouldn’t help the gradient descent too go faster too much.

So, we regard that we lose very small speed of the gradient descent. And this is not significant.

**3.**

**Logistic Regression (classification), 3 categories,**

– 3 categories: Buy, Cash, Sell signals;

– these 3 categories are defined by the %thresholds of: -1.5%.. +1.5% (so, if Y output is in that range, we regard Y output as Cash)

– gradient descent iteration parameter: 400 (as before)

– we chose **not to normalize the input**, as in the previous case. **That is also a parameter: “Normalization: OFF” from a possible basket of normalizations { mean normalization, mean and min-max range normalization, mean and std normalization, only std normalization, etc.)**

We do** sensitivity analysis for lookback days:**

<click on the image to see it properly>

What we can observe that in general is that the** classification solutions achieve less final PV** (Portfolio Value).

Why is that?

The reason lies how the methods handle the training samples.

**For Classification, all training samples are equal, irrespective of their Y magnitude.** As it takes only the Sign() of the %change (+1, 0, -1). If we have a VIX spike of +40% on a day, that is treated equally to another day that has +2% VIX change (in the classification case). This has some advantages and disadvantages. Particularly, the classification is less sensitive to outliers. However, it turns out that exactly these outliers are very important in our problem. When VIX increased on 1st August 2011 by 50% on a day, which was a huge increase. It instantly modified the non-classification (but regression) based solution to be positively biased. Only one of this outlier could have a great effect for the next weeks, months. Afterward, all the predictions were Upside biased: it was more likely to forecast Up %changes than Down %changes.

However,** for classification, this huge 50% %change was only another sample with +1 output value.** It took a long time, until the Classification methods realized that we are in a new regime: in a regime where Up days are more likely than down days.

In that sense, **Regression is more agile, it adapts more quickly to a regime change that is signed by an outlier than Classification. And it turns out that in this problem case, it is better.**

In another problem case (forecasting the price of houses) this outlier sensitivity would be counterproductive.

This is very well illustrated in the Ultimate predictor PV chart.

The **Ultimate predictor aggregates the different lookback predictors from 75 to 110 lookback** days and do a majority vote. It is the PV (Portfolio Value) chart:

**The 2 occasions when the Linear Regression outperformed the Classification is when the low VIX regime changed to a high VIX regime: in 2010 summer and 2011 August. In both cases, regression was quicker to adapt.**

The 3 classifier case has the lowest drawdown in the PV chart, but the lowest profit too. This is a kind of trade-off. We can go to Cash sometimes. This, obviously, decrease the drawdown, but as we don’t participate in the market in this less certain times, we leave profit on the table. However, that can be good for a conservative, non aggressive version of the strategy.

Observe also that in the Sensitivity Analysis chart, we can witness that the **3 categories classifier achieves the least PV**. That is somehow expected, **because it is in Cash about 30% of the time. It has probably less drawdown too.**

It is unexplained however that at the far end of the sensitivity chart (having more than 150+ lookback days) why the 2bin classifier performs so poorly (it goes back to the PV = 1 line, having no profit in 2 years), while the 3 bins classifier (that is in cash 30% of the time) has PV = 2 in this region of the sensitivity analysis chart.

**Conclusion:**

We compared regression and classification. **In our prediction problem, regression was better, because it doesn’t trump the effect of outliers.**

**The binary and the 3 categories classifier perform similarly to each other.** That means their PVs are equal (Ultimate version), albeit **the 3 bins version has lower drawdown, suitable for conservative implementation.**

]]>

**Visualize your data**

Our quest is supervised learning is to find function f(x) that is likely to generate your training set. The training set is defined by in which the input X, has output Y labels attached to them. One thing you learn quickly is the importance of analysing you data. There are some problems with it.

In one hand, the dimensionality: It is very frequent to have multidimensional data (10, 20+), however we — Earth people — are very poor visualizing anything that more than 3 dimensional.

In the other hand, the problem of visualizing is that if the data contains a lot of noise, it is difficult to see any meaningful structure in the data.

Luckily, in our experiments, we try to minimize dimensional complexity. Mostly, to mitigate the problem of overfitting.

We showed (2 posts ago) that 2 dimensional time series prediction was better for VXX than 1 dimensional one.

Therefore, we continue with the 2 dimensional case.

Our x1 dimension (horizontal axis) is the %change today, x2 (vertical axis) is %change yesterday.

Having 3 years of historical data, let’s look at it:

This plot shows when the tomorrow %change is positive (green + sign), or negative (red o sign).

Do you see any meaningful structure?

Not easy. Because of the lot of noise (and unfortunately it is not a white noise).

Some things can be concluded though:

– there seems to be more red dots overall. (expected) More VIX down days.

– green (VXX up days) are probable when either today was up highly or yesterday was up highly (expected: volatility brings more volatility)

– down VXX probable when the market is peaceful (small up, small down moves in the last 2 days)

But overall, the plot **looks so random, it is difficult to imagine how can we separate the two groups: the positive days from the negative.**

Obviously there is no linear separator.

This plot is useful, if we do **classification to 2 groups (Up, Down),** but what it we would like to do **classification into 3 groups:**

**Bullish days, Bearish Days, Cash days. Cash would mean that the %change was mild: -1%..+1%.**

Let’s make a plot. The black diamonds represent those Neutral days.

More or less the same can be said. Extra conclusions can be made like:

– there are no black dots (Neutral days), if today or yesterday %change is extreme (so the Neutral days happens usually in less volatile regime)

– if the VXX %gain was higher than +20% today (2 cases), then it was followed by another VXX increase

– when the VXX %gain was negative today, and negative yesterday, it is likely it will be negative tomorrow. (the VXX has a daily FollowThrough, momentum)

** 2. Visualize your final fitted prediction model (f(x))**

Let’s suppose we do a Linear Regression learning described in the previous posts.

How does the decision surface look like?

It looks something like this:

We draw the decision boundary by black dotted line. That represents the Y values where the f(x) is zero. That separates the up forecast from the down forecasts. The plot is dated on 2011-10-28.

The prediction can be made manually from the plot, if we know the %change of yesterday (vertical axes) and %change today (horizontal axis). For example, if both is 0%, the prediction is in the yellowish (upper area), the prediction is a positive tomorrow %change.

Note that **observing the f(0,0) is a good way to evaluate whether the current model is Upside biased or Downside biased.** Because it was taught by feeding the last 93 trading days samples, and from August 2011 we are in a very volatile period, it is not a shock that f(0,0) is positive, so the model mostly predicts Positive values. (Positively biased)

As we anticipate negative VXX changes in the foreseeable December, Xmas season, it is not advised to start trading the strategy right now.

]]>

To be frank however, there are more parameters:

– whether selecting 1D or 2D case; or

– the machine learning algorithm used: (Normal Equation or Gradient Descent, etc.), or

– selecting the kind of instrument: VXX

However, let’s say that those parameters are not really parameters per se.

They were determined much earlier. They were determined by some other fundamental ideas we believe, and therefore we don’t optimize those parameters. For example just accept that we want to estimate daily VXX (not the RUT or AAPL). There is nothing really we can fine tune in it.

Therefore those parameters are not the focus of any sensitivity analysis.

After the prologue, let’s do some sensitivity analysis on the lookback days.

Note that we had to fix the startDay of those algorithms in theses backtest. Because we use maximum 200 days lookback, the first estimate can be calculated for the day: 201 (in the 1D case) and day: 202 (in the 2D case).

In this test, to make a fair competition between the different lookbackDays, we started all from day 201 or 202.

In theory, the 50 days lookback version can be started from day 51. However that would give extra advantage for the shorter period lookbackDays. (if they have larger period to play).

We want a fair comparison, so we cannot allow that.

Note that this is the reason, why for example in the previous post (1D) case we showed that lookbackDays = 50 was the best, achieving 10x multiplier.

This cannot be found here, because of the aforementioned reason.

Sensitivity Analysis (487 days = less than 2 years, assuming 250 trading days):

**1. 1D case:**

Let’s plot the final portfolioValue as a function of lookbackdays.

In the chart, the X axis is the lookbackDays – 1, so the chart is shifted by one, but it is OK.

(click on the image, if you want to see it proper original size)

based on that: the optimal lookback is somewhere between 30 to 60. Is it sensitive to the parameter? Yes, as it is usual. For example, the best parameter value gives 7x multiplier, the worst is 1x multiplier; so we can say it is quite sensitive to the parameter.

Note the range of 2-20 training samples: that is hardly enough samples; I wouldn’t consider those area to be useful at all, even if it shows a good performance.

So, the optimal value of the parameter is somewhere between 30 and 60. One strategy is (if we want to avoid parameter fine tuning) is just play middle: 45.

**Do you see the danger here? Someone, who optimized the parameter and haven’t done any sensitivity analysis, thinks that it returns 7x per 2 years, and starts to play the strategy. But in real life, he can be unfortunate to have only 1-2x return in the future. (or he can be also lucky and get 14x return in the future.) The point her is that the future return should be expected less than it is shown in a fine tuned parameter backtests.**

One idea to make it more stable:

Do different parameter runs (from 30 to 60): average their prediction; this may partially eliminate the parameter fine-tuning bias.

So, let’s define our UltimateEstimator by aggregating the decision of 30..60 lookbackdays.

The portfolio value curve of the UltimateEstimator gave PortfolioValue of 3.90:

UltimateEstimator is between the extremes: it is better than the worst (2x multiplier), but it is worse than the better 7x multiplier

However in real life, it is better to use this kind of estimator; it decreases the ‘lucky’/unlucky factor of the dependency of ‘parameter fine tuning’, concrete parameter selection.

Also, in general the aggregated Ultimate profit curve is smoother (less likely to contain DD), albeit, the -50% DD is still preset here, but even with that DD, it is smoother than the individual strategies

**2. 2D case:**

Let’s plot the final portfolioValue as a function of lookbackdays:

Based on that: the optimal lookback is somewhere from 70 to 115.

Is it sensitive to the parameter? Yes, as it is usual.

best parameter value: 7.5x multiplier, worst is 2x multiplier.

even with the unluckiest pick of the worst parameter, the profit was 2x (so, it is not a loss). That is good.

The only loss is in the range of 2-10, and 35-40. There are not enough training samples there.

Someone, who wants to avoid parameter fine tuning bias, **may choose the middle of the range: 93.**

Another idea to make it more stable: the same UltimateEstimator. Do different parameter runs (from 75 to 110): average their prediction; this may partially eliminate the parameter fine-tuning bias.

**Aggregating the decision of 75..110 lookbackdays, the result UltimateEstimator** gave PortfolioValue of 4.24

**That is between the extremes: it is better than the worst (2x multiplier), but it is worse than the better 7x multiplier. **

**3. Conclusion:**

Note that with the Ultimate(75-110) version, we eliminated the parameter=fixLookbackdays, but we introduced 2 new parameters (instead of 1): 75 and 110. :), so we again have some parameter bias; albeit note that we wanted to optimize the fixLookBackday parameters, but we haven’t ‘really’ optimized the range parameters: 75, 110.

The important note is that **we introduced 2 new parameters**, but the final result is not really sensitive to changing these 2 new parameters. Changing 75 to 76, hardly changes anything, while in the fixLookbackDays parameter case, changing that parameter from 93 to 94 had more significant effect on the final outcome.

**This is the key message of this post: we cannot eliminate parameters, but what we can do is to assure that if we have parameters, the final outcome is not significantly sensitive on the parameters used.**

**Use 1D or 2D?**

Comparing the 200 days long 1D vs. 2D **sensitivity chart** (not the Ultimate Portfolio Value chart), we prefer the 2D inputs case.

**The maximum achieved is similar** to the 1D case (max x7 multiplier was achieved: the 2D case achieved it about 3 occasions, the 1D only 1 times)

The **minimum is better in the 2D case. In the 1D case, if we pick the wrong parameters, we can have a profit of 1x.**

**However, for the 2D case, even if we picked the wrong parameters, we can have a profit of 2x.**

Comparing the range based **Ultimate Portfolio Value chart,**

The 2D case is better also, because for example the less DD. (see the big DD that we had in the last 3 weeks in the 1D case)**. The 2D case equity curve looks smoother** too.

]]>

Let’s assume we want to forecast the next day %change of VXX as an output variable, based on the today %change of the VXX and the yesterday %change

The linear equation would look like this.

**Y = beta0 + beta1*X1 + beta2*X2**

where

X1 = yesterday %change,

X2 = today %change,

Y = next day %change.

The unknown is the beta0, beta1, beta2. We want to determine (learn) them.

Let’s suppose that learn them by looking back in the history by D days, where D can be 20, 50, 100, 200 days.

The corresponding equity curve charts:

Conclusion:

– not much. **They are all similar. That is good, because they are consistent**

– the max. DD is: from 3.5 to 1.5: -57% (scary)

– in the 1D case, the lookback50 was the best, in the 2D case the lookback20 is the best. (probably it is just randomness)

– ** they are similar to the 1D charts. So it seems that introducing another variable (yesterday %change) doesn’t give more useful information for the prediction. It gave more information, but that information was not useful for extra the profit. This can be typical for machine learning. If we introduce a completely random extra variable (as a new dimension) (a non-dependent variable), it can even destroy the prediction power of the simpler case.**

– ** Based on these charts, we would stick to a simpler 1D Linear Regression than the 2D version. That may have a little better profit potential.**

]]>

The **Stanford University Machine Learning course** mentioned in the previous blog post is not only theoretical, but** very practical indeed**. I would say it is even more practical than theoretical, that is a bad news for theoretical mathematicians, but good news for the applied scientists or for programmers. The course force students to write homework programs every week. The suggested language is Octave, that is a free, open source version of Matlab. One of the topic in the last week was Multivariate Linear Regression and two approaches for the solution: the Normal Equation and the Gradient Descent.

In the context of this blog, we pursued Neural Network based solution of the problem, but for this post, let’s just solve the Matrix Equations.

In this post, let’s assume we want to forecast the next day %change of VXX as an output variable, based on the today %change of the VXX.

The linear equation would look like this.

Y = beta0 + beta1*X ,

where

X = today %change,

Y = next day %change.

The linear regression is finding the coefficient of the line that mostly fits to the data, like here:

I usually say that from the sample points we regress back the line (we determine it, we guess it) that is very likely to generate those sample points.

The unknown is the beta0, beta1. We want to determine (learn) them.

Let’s suppose to learn them by looking back in the history by D days, where D can be 20, 50, 100, 200 days.

beta0, beta1 = ?

How to solve it?

The solution is the OLS estimator, where OLS stands for Ordinary Least Squares and

In a nutshell, you have to evaluate this equation, which is using Octave/Matlab matrix operations, it is pretty straightforward.

For the geeks, see the details here: http://en.wikipedia.org/wiki/Linear_regression

I would like to stop here a little bit. Just** looking at the equation: Beta = (X’X)^-1 * X’ y.**

**Why is it the equation? The proof is pretty straightforward**.

Consider the original eqution:

X *Beta= y.

Try to determine Beta = ?

We cannot multiply both sides by X^-1. Why? because X is not a square matrix. **If X is not a square matrix, there is no inverse matrix.**

So, multiply both sides by X’ first (X transpose) to have

(X’X )*Beta= X’y

Now (X’X) is a square matrix, so we can have an inverse. Let’s multiply both sides by this inverse.

(X’X)^-1 *(X’X) *Beta= (X’X)^-1 *X’y

that is equivalent to:

Beta= (X’X)^-1 *X’y

**The advantage of the OLS method compared to the Neural Network, or Gradient Descent is that it is **

**– deterministic.** All the Neural Network solutions are randomized, therefore requires a lot of random runs for backtesting. In contrast, **OLS requires only 1 backtest.**

-easy to compute (takes half a second)

– **OLS doesn’t require normalization of the samples.**

-the whole method has only 1 parameter: lookback days. That is contrast to the NeuralNetwork based solution that has another parameters: lookbackdays, outlier threshold, numberOfRandomRuns, weighting of the decision of the neural network, normalization parameters (SD or min-max normalization, range normalization or mean normalization too?).

**-having only 1 parameter significantly reduces the parameter fine-tuning bias that distorts the results of many backtests.**

– The disadvantage of OLS that it can capture only the linear relations of the inputs vs. output. In contrast to the Neural Network that can describe any continuous functions.

In our concrete example, we took the VXX close prices from its inception. That is about the beginning of 2009.

We run the algorithm with lookback days = 20, 50, 100, 200.

We also plot the SMA70 of the strategy (as a mean to use some playing the equity curve technique).

The return curves of the strategy looks like this:

What can we realistically say. The charts are simiar.

-For** the 200 days lookback,** we can see a it went from 1 to 3 in about 2 years.** That is 70% CAGR. Not bad.**

-However, the **maxDD was -50% (2010 summer), which is pretty high.**

– the best performer was the **50 lookback days** (probably that should be played in real life). That **multiplied the initial deposit by 10 during 2.5 years. That is about 150% CAGR**, **but we consider this performance as an outlier.** Also note how volatile was this in 2011 august (albeit volatile in the favoured direction).

-someone could start the strategy when the profit curve is above the SMA70, as it is now (as a means of money management)

-someone could start the strategy when the profit curve is higher than the previous highest high (maybe it is safer: less whipsaw)

On the other hand, it is worth mentioning that these are only theoretical results. Real life can be harsher than this. Sometimes because of the parameter fine tuning bias, sometimes because of that real life order execution is not perfect: (ask-bid spread, commissions, not executed short sale orders, because there were not enough stocks to be available to borrow, etc.)

In future posts, we will examine the 2D input case, and we will also do some Sensitivity Analysis on the ‘lookbackdays ‘ variable.

]]>

I would like to draw your attention to a unique Stanford University initiative. In this season, the first time ever, you can participate in a unique research project that intends **to change the future of the education.**

Stanford University has announced to make 2 courses available online Worldwide!

**-Introduction to Artificial Intelligence**

**-Machine Learning**

An exceptional thing about this course (compared to other online courses like the MIT online courseware) that** it is not simply viewing offline videos later**, anytime when you have free time, but you do homework, assignments, test, and exams as you would do it in a case if you are really a Stanford University student. You even get a certificate about the completion and** certificate** of your own results,** comparing your results to the rest of the ‘world’**.

The writer of this blog is very pleased with this announcement because of:

– the firm belief that the **‘teacher’ as a job will be mostly outdated** in the next century. I reckon in 30 years we will need only 10% of the teachers as we have now.

– I welcome the** integration of universities/courses.** This is the most efficient way to distribute the best tutors to the widest audience. I would like to see only the best 500 universities in the world to survive than having 5000 (poor) universities scattered all around. Having 5000 universities is very inefficient/costly way of distributing knowledge.

– I welcome the idea that the **knowledge is public.** Available to anyone from the skyscrapers of New York to the slums of India. No means testing, no university fees. Everyone is equal and** it is possible for everyone (with enough diligence) to achieve university degree.**

Topics include:

”

supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs; VC theory; large margins); reinforcement learning and adaptive control.

”

Yummy.

Probability and linear algebra is a requirement, of course.

There are 140K students applied for the Artificial Intelligence course and about 60K for the more advanced Machine Learning course.

Note that there are 2 kinds of virtual students: One that follows only the videos (**spectators**), but do no homework or tests. They **receive no certification.**

Currently, it is not published what percentage of the students are in the spectators club. And the truth is that you can change your status later during the term. If you find that you don’t have enough time, you can switch to be a spectator anytime.

The homepages:

https://www.ai-class.com/home/

http://www.ml-class.org/course/class/index

Note the time requirement tough:

Stanford advices to spend **10 hours per week on one course. That means 20 hours per week for the two courses**. Those who don’t have enough spare time can consider taking only the simpler Artificial Intelligence course only, albeit take into account if the Machine Learning course don’t start next year (then you missed your chance).

We **encourage everybody** who has some time to take part in these excellent initiatives and **become a student of Stanford University** and be (a little) proud that you **participate Now in something that is the future of university level education.**

]]>

This event may be regarded as a very important milestone in the life of this blog. The sole purpose of the research (that we recorded in this blog) was to develop an algorithm based on Machine Learning (preferable Artificial Neural Network kind) that can be played live on the stock exchange.

We are happy to announce that we reached that milestone.

Actually, **2 versions** are played now.

**1.
The Aggressive version. **

This one is the

It means its inputs are the today and yesterday price change of the RUT index.

We don’t use the day of the week input here. It only drags down the performance.

It is a risk taker. **It never goes to cash. **

You can check somewhere in the previous blog article reports about its performance numbers. The unleveraged version **did about 35% gCAGR, and 40% drawdown** in the past. We know nothing about the future.

**2.
Conservative version.
This one actually has 4 ensemble groups.
-ANN(T-0)
-ANN(T-0) // it is the same as the first group; mostly for stability
-ANN(T-1)
-ANN(day of the week for T-0)**

All 4 groups have to agree. They have to be in consensus.

There are 3 possible scenarios:

– All 4 groups vote +1 for next day: consensus is Up

– All 4 groups vote -1 for next day: consensus is Down

– otherwise: consensus is cash

So, if this strategy is** not convinced,** not confident enough about the next day direction, **it goes to cash.** That is the conservative approach.

This conservative version had **about 18% CAGR and 30% drawdown** in the backtests.

Past **backtests are based on playing it on the unleveraged RUT index.** However, it is not possible to play the RUT index in real life. Either we play the futures or play ETFs. We **picked the double ETF (ultra, and ultra short)** and we play that.

That is how these strategies performed in the last month:

Interesting that the **Agressive version is the laggard;** albeit we expected it to have better performance than the Conservative.

Note has to be made that we were lucky with timing: if we start of the portfolio 2 weeks earlier, the profit wouldn’t be as good as now.

Obviously, **the period is too short **to be happy about it or draw serious conclusions. So, let’s wait and follow them.

Real live trading will be extended with another trick. We plan to use some kind of money management, in case the strategy turns sour. For example, the **‘playing the equity curve’** technique.

It hasn’t been developed yet. It **should improve the future drawdowns.**

]]>

**1. The Book**

I can wholeheartedly suggest a book called

Cartoon Guide to Statistics .

There is another similar piece of work,

The Manga Guide to Statistics

I took a quick look, but its story is built around a romantic relationship of a young Japanese girl.

It is definitely not how I would like to see one of the most difficult and serious part of the mathematical science to be presented.

Never mind, manga fans may be interested in it.

The Cartoon Guide to Statistics is a very well organized book. It touches almost all part of statistics, but of course it cannot go very deeply into the topics. It is funny, amusing, it is enjoy to read.

I contend that an average secondary student should have no problem with the difficulty, albeit some reviewers (interestingly UK reviewers mostly) complained that the book is too complex, and they couldn’t follow it. Trust me; it is a very easy book.

Strongly recommended for all math students. (Hopefully in the first year at the university)

However, reading the book that very nicely summarized what statistical science is left me with an uneasy feeling.

I hope I don’t offend anybody, but to me (very personal opinion), the whole framework of the statistical tools looks only as a mathematical toy. However, in real life it doesn’t work, it cannot be used, it cannot be trusted. We can play with it, as we play with toys, but what for?

It is a nice math framework, but real life doesn’t play by these rules that are defined in statistics.

Instead of talking vaguely why I despise it, I try to give some concrete examples.

1.1

One of the tools that I think is a joke is called re-sampling.

A technique that treats the sample as if it were the population. It has other names like randomization, jackknife, bootstrapping. To me, it looks as a funny, but surely non-working tool in real life. Yes, it is true that you can prove mathematically, that it works, but in real life, would you use it?

Let’s suppose you watched how a stock was traded for a week. One week is clearly not enough to draw statistical or any conclusions that you would trust to risk your own money.

Now, clever mathematicians invent a tool called resampling (or bootstrapping, whatever).

Based on that 1 week observation, that has only 5 samples, they generate another 500 samples.

Now, you have 500 samples. So, you can make reliable statistical conclusions!… or not.

It can be proven mathematically that your generated 500 sample are unbiased estimates for the other non observed samples.However, those 5 initial samples based only on that 1 observed week. Maybe that week was the Xmas week. Even if you generate 500 samples from it, would you risk your money on non-Xmas weeks based on the 500 artificially generated samples?

No.

However, as the unbiased nature can be proven mathematically, you have a false sense of confidence in your method.

1.2

Another suspicion: a typical thing in statistics, that they start the chapter by ‘let’s suppose that the results of each trial are independent’.

What?

Where in life can you find something that you measure many times and they are independent? Yes, maybe the rolling dice (maybe even that is not true), but what about the stock market? Can you assume that the daily stock returns are independent from each other?

You can assume it; at your own peril! Because it is not true.

Then if it is clear that statistics can work only artificial, nonexistent scenarios, why should we use it at all?

1.3.

Another simplification: let’s say we make a statistic about the repair cost of crashed cars.

We take a sample of 10 crashed cars example. Can we assume that the repair cost samples are independent, normally distributed samples of a random variable in real life?

Absolutely not. Why would they be independent? Assume there was a heat wave last week in the country. Half of the car crashed related to this event. Cars without air conditioner (more tired drivers) are represented more. Are the repair cost of the air-conditioned and non-air conditioned cars different? Yes, non air conditioned cars are usually cheaper, less costly to mend. These 10 sample cars may be not independent.

Is the other assumption, that the repair cost is normally distributed true? Absolutely not. If the engine of the car is damaged, the repair cost is much higher.

Therefore the distribution probably has at least 2 peaks. One for the cases when the engine is not damaged; another for the engine damaged cars.

So, it is nice to use statistics for real life, but be aware that you make many assumptions that are simply not true, and the mathematical tools were not designed to be used with these samples.

At least, don’t expect a correct answer from your statistician.

1.4.

The book starts with a sentence:

**“Statistics quantifies uncertanity. To make categorical statements, with complete assurance about their level of uncertanity.”
Complete assurance…**

That is a joke.

Maybe, in the case of a rolling dice. Because a ‘perfect’, non-tempered rolling dice behaves nicely according to the Gaussian distribution.

But what about real life? Like the stock market. Stock prices are very far from behaving nicely.

Someone can say that, OK, don’t use the Gaussian statistics, use the power distribution.

But do you honestly believe that a stock price behaves according to the power law? No.

My firm belief is that there is no mathematical formula that can describe that distribution.

The stock price distribution lives in its own world. It doesn’t obey the law of mathematics.

These are only 4 reasons why I have a feeling that statistics is a childish toy only. It is a tool that we use to trick ourselves that we can understand and describe the world.

With all its delicate details and mathematical legerdemain, statistics is just a clever game for kids:

it has its rules, you can use it to amuse yourself, you can thing that you are clever, because you use it, but it is far from being usable in real world situations.

Then why should we bother with it at all?

Maybe the answer lies in a quote from Einstein:

**“One thing I have learned in a long life: that all our science (‘math’),
measured against reality is primitive and childlike
– and yet it is the most precious thing we have.”**

I cannot concur more.

Or equivalently the quote from Einstein: “God does not play dice.”

My interpretation of this quote is that The Universe’s (-not God; Einstein was not religious-) works in a way that it cannot be described by simple probabilities of the rolling dice (which is the Gaussian distribution).

Some people shares this kind of interpretation with me, somebody interpreted it in a web forum as “probability / statistics is wholly inadequate to explain/model real world quantum effects”.

You may have the right to disagree with me in this interpretation. The most popular interpretation is that “The Universe is not random, but deterministic” which is a viable rendition too.

**2. Bessel’s correction in SD**

It is universal that people, who don’t understand statistics, try to use it. (Me among them).

**A typical misunderstanding** is in the standard deviation (SD) that whether using N or N-1 in the denominator.

The rule of thumb is that we divide by N in the population (or model) standard deviation and we **divide by N-1** in the sample SD formulae.

When some (less mathematically educated people) see that there is N-1 used in an equation in an article or a book, they even suggest with great confidence that it is wrong and the author made a mistake. (Because if they have only 1 sample, then we should divide by zero.)

However, they are wrong.

Define 2 statistics:

**A. ‘standard deviation of the sample’ (SDoS)**

This one is using N in the denominator. However this estimator, when applied to a small or moderately sized sample, tends to be too low: it is a biased estimator.

**B. ‘sample standard deviation’ (SSD)**

This one is using N-1. This is the most commonly used, the adjusted version.

This correction (the use of N – 1 instead of N) is known as Bessel’s correction. The reason for this correction is that SSD^2 is an unbiased estimator for the variance of the underlying population. (note, even SSD is not an unbiased estimator for the population SD, only the SSD^2 is the unbiased estimator for the population variance.)

Bessel’s correction corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation.

So, the term standard deviation of the sample (SDoS) is used for the uncorrected estimator (using N) while the term sample standard deviation (SSD) is used for the corrected estimator (using N – 1). The denominator N – 1 is the number of degrees of freedom in the vector of residuals.

”

That is, when estimating the population variance and standard deviation from a sample when the population mean is unknown, the sample variance is a biased estimator of the population variance, and systematically underestimates it. Multiplying the standard sample variance by n/(n – 1) (equivalently, using 1/(n – 1) instead of 1/n) corrects for this, and gives an unbiased estimator of the population variance.

A subtle point is that, while the sample variance (using Bessel’s correction) is an unbiased estimate of the population variance, its square root, the sample standard deviation, is a biased estimate of the population standard deviation; because the square root is a concave function, the bias is downward, by Jensen’s inequality. There is no general formula for an unbiased estimator of the population standard deviation.

”

One can understand Bessel’s correction intuitively as the degrees of freedom in the residuals vector:

(X1-X_avg,

X2-X_avg,

…

Xn-X_avg)

where X_avg is the sample mean. While there are n independent samples, there are only n – 1 independent residuals, as they sum to 0

In intuitive terms, we are seeking the sum of squared distances from the population mean, but end up calculating the sum of squared differences from the sample mean which is (in effect) defined as the position which is closest to all the data points, i.e. it is the point that “minimizes that sum of squared distances.”

This estimate will always underestimate the population variance. Because it is a minimization of something. So, it is another way to see that the SDoS understate the population variance.

I illustrate the necessity of using N-1 in the denominator in the proof that shows that the SSD^2 is an unbiased estimate of the population variance.

If we put N, instead of N-1, the estimator is not unbiased.

The problem as you see is that the samples are also used to generate the sample mean (as an estimator for the population mean), and that will subtract one Sigma^2 from the variance.

– Ok, so we see that if we want **unbiased estimation for the variance, we can have it (using N-1 in the denominator). However, this estimator is not unbiased for the SD.**

So, either we use N or N-1 in the SD formula, it doesn’t matter, because none of them will be unbiased (because square root is concave).

But it is not too difficult to see that **using N-1 is more accurate, so we should use it not only for the variance, but for the SD too.**

What usually bothers people using N-1 is that if we have only 1 sample, then it is not possible to estimate the SD of the sample or the SD of the population. Because: division by zero.

And then they think the problem is in the formula, and we should use N instead of N-1.

However, it is not true. The formula is correct.

The problem is in their thinking. If we have only 1 sample, it is really not possible to estimate the population mean and the population SD at the same time.

It is crazy to expect that it is possible at all.

In estimating the SD, we have to estimate the population mean first.

**So, it is really true: if we have only one sample, we can estimate one thing only: the mean.
With that estimation, we used all our 1 degree of freedom (all our data), and there is no extra information to use for estimating the variance.**

Let’s suppose you have only 1 sample, and try to use the N version for calculating the variance.

What will be the result: Variance = (X1-Mean)^2. Because mean = X1, it will be zero.

Do you really accept that this constant zero is a good estimate for the population variance? (Independent of the sample)

Absolutely not. It is much better to say that we cannot calculate the variance/SD than estimating a variance (the zero) that we know for sure that is wrong.

**This should be treated in our mind as the division by zero case. Division of any number by zero is not defined.** Similarly,

**SD of any sample having only 1 observation is not defined. Get used to it.
Algorithms, programs should return NaN (Not a Number), instead of 0 in that case.**

**3. Confidence in our backtested CAGR**

Let’s use statistics for gaining some confidence in our Neural Network prediction system.

Our Neural Network algorithm is not a deterministic one, because of the random initialization of the NN weights. The backtest is a random variable. Its random results fluctuate around the population mean, the true expected value of the backtest. The backtests run for the last 23 years. ANN (T-0, T-1) version.

We did **run 13 backtest experiments and found the following annual %profits (gCAGR):
35.08%, 33.63%, 33.29%, 37.00%, 35.02%, 35.68%, 36.14%, 33.17%, 33.94%, 34.50%, 34.04%, 35.49%, 34.75%.**

**The arithmetic average is 34.75%. That is the sample mean.** However, it is not the population mean: the true expected value.

Is this 34.75% a good number? Should we be happy about it? Should we trust enough to play it in real life?

**There are 2 ways to answer the question: is 34.75% gCAGR good enough or not?** (Both approaches are viable to answer this question)

**A. with confidence intervals:** e.g. with 95% confidence, we can say that the true gCAGR is between X, Y

**B. with hypothesis testing:** Assume gCAGR=0% (=H_0); what is the chance of having 34.75% as the sample mean? Can we disprove H_0?

Note that intuitively, looking at these 13 numbers, we feel that the strategy is stable. Sensitivity Analysis would prove it robust.

But, let’s suppose if another backtest would give these numbers:

0%, -42%, 0%, +42%, 173.75%

The mean of these tests are also: 34.75%. Would you trust this algorithm with your own money?

The point of the following 2 kind of analysis to get you enough confidence in the backtest results.

**A.
We try to estimate the NN true CAGR with some degree of confidence.**

The 34.75% is only the sample mean. Let’s calculate the sample standard deviation (SSD) SSD = 1.14%.

This is from the excel table that summarizes it:

In human language form:

-“We can say with 68% confidence that the true population mean CAGR is between 33.60% and 35.89%”

-“We can say with 95% confidence that the true population mean CAGR is between 32.46% and 37.04%”

-“We can say with 99% confidence that the true population mean CAGR is between 31.32% and 38.18%”

That looks good, because ev**en assuming the worst = 31.32% (with 99% confidence), we have positive CAGR, we don’t lose money.**

However, even with that, there is 1% chance (once in 100 times) that the CAGR is not in that range, so we are not as profitable as we expect.

And for the sake of completeness, a very useful statement:

-“We can say with 100% confidence that the true population mean CAGR is between -infinity and + infinity”

So, we can statistically prove with 95% confidence that the strategy is good, but: who cares? It is only playing with numbers. Real life doesn’t bother obeying our proof.

In real life you should expect much worse performance with much worse confidence.

I guess if you can prove with math tools that it works 95% of the time, in real life it works only 80% of the time.

That is because the synthetic mathematic rules, assumptions (like independent, Gaussian random variables) are so far from real life.

(And partly, another reason is the backtest bias, but the backtest bias makes the CAGR performance worse, and not the confidence worse.)

So, one can ask the question: why bother at all with the task of proving that the strategy works. We can prove it, but as we know how unusable it is, why should we care?

I contribute a quote to George Soros, “If it works, do it.”. Don’t bother too much trying to prove it mathematically.

So, I guess it is better to spend our times on experiments, simulations than trying to think about the reasons and prove theoretically why a strategy works.

The conclusion of this way of thinking was that

**We can say with 99% confidence that the true population mean CAGR is between 31.32% and 38.18%. Since even the lower interval is a positive number, we
are sure (99% sure) that our strategy has same alpha (profit edge), so we are confident to start this strategy in real life.**

**B.**

Let’s go another way. Instead of confidence intervals, **use hypothesis testing: **

Let’s form a **Null Hypothesis: H_0: assume gCAGR = 0%. (or gCAGR <= 0%)**

This means that our strategy is not better than random.

The **Alternative Hypothesis: H_A: gCAGR > 0%.** That our strategy has a genuine prediction power.

The question we try to answer is

**Assuming H_0 is true, what is the chance of having 34.75% as the sample mean? Could it occur by chance?**

If the chance is too low, we can disprove the Null Hypothesis H_0, and prove the alternative hypothesis.

Formally:

Pr(X_avg > 34.75% | gCAGR= 0%) = ?

Let’s calculate the Z value of the statistic.

Assuming gCAGR = 0% for the population mean, the sample mean would have the same expected value.

Z_value = (34.75% – sample mean) / (SSD / squared(13)) = (34.75% – 0%)/ (1.14%/3.6)= 109.7

Have you ever seen a Z score like that in your life?

**Pr(Z_value > 109.7 | gCAGR= 0%) = 0.**

Actually I couldn’t find any software package that could calculate this number. They usually say = 0.

After Z score = 4 or 5, the chance is so miniscule that it is just not possible.

We wanted a 1% significance level, but this probability is much smaller than the 1%. (it is virtually 0).

So, **there is virtually 0% chance that having true gCAGR= 0%, we observed these 13 backtest results.**

Therefore, **we reject the Null Hypothesis**, and we are happy that our strategy is genuinely profitable. (gCAGR > 0%).

However, even if it is true, it only state that the true population mean (gCAGR) > 0%, but it doesn’t prove anything about the future. Stock markets may close down. In that case: no profit. Armageddon: no profit. We contend something about the past only. Also, even if the past continues more or less in the same way in the future, we know nothing about what will happen next year.

Maybe the strategy will be profitable if we play it long enough (> 20 years), maybe that 34.75% profit is contributed mostly for only some unique years (like 2008).

As these years may be not repeated in the future, **we hardly know anything about the future potential of the strategy. We can only acquire information and derive conclusions about the past potential.**

This is partly why money management techniques have to be used with any real life played strategy. Because proven successful strategies in the past may stop working in the future. Techniques like ‘playing the equity curve’ will signal alerts to terminate a strategy when it is likely that it stopped working.

We reviewed 2 different ways of answering the question: Is this 34.75% a good enough number?

A. with confidence intervals, and

B. with hypothesis testing.

We prefer the confidence intervals way of thinking, because it expresses a concrete range with lower and upper bounds, while hypothesis testing has only a binary answer: it is possible or not.

]]>

Parameters:

NensembleGroupMembers = 5;

int nTest = 5;

Note that we also run the NensembleGroupMembers = 1; experiments (that took another week), but we don’t present them here in the blog. Those wore inferior results (as expected).

We run 11 different experiments with various ANN combinations; the result can be found in the next tables.

PV (portfolio value):

PvMAPEfrMul (PV Mean Average Percentage Error From Multiplicative Approximation, the smaller the better; the smoother the PV):

Note that the performance numbers here in the next section is the average of 4 cells: the maxEpoch=49, 99; and nNeurons=1,2 combos. Because the most likely is that we would play these parameters.

The strategies and — in a little chaotic mode — some performance numbers that lead us to pick the winner. (Sorry, it is quite chaotic. Feel free to jump to the notes section.)

**A. only day and T-0 inputs (for clarity)**

– ANN(T-0) agree ANN(day), nEnsembleRepeat = 1 (PV:50.9, dStat:59.16%, maxDD:41.39%)*

– ANN(T-0) agree ANN(day), nEnsembleRepeat = 11 (PV:70.6, dStat:58.16%, maxDD:42.55%) (PV70: better than nEnsembleRepeat = 1 PV60; maxDD42.55% (worse than the repeat=1, but OK) the PV compensates; zeroFrq: 25.20%)

– ANN(T-0) agree ANN(T-0), nEnsembleRepeat = 1 (PV:427.5, dStat:57.02%, maxDD:50.88%)

– ANN(T-0) agree ANN(T-0) agree ANN(T-0), nEnsembleRepeat = 1 (PV:446.8, dStat:57.22%, maxDD:48.8%)

**B. T-1 input too**

– ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1 (PV:20.91, dStat:59.63%, maxDD:35.08%)**

– ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 11 (PV:30.51, dStat:58.57%, maxDD:39.33%)

– ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1 (PV:23.6, dStat:60.14%, maxDD:32.79%)** (considering only the 1neuron case: PV: 29.22)

– ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 11 (PV:30.92, dStat:58.91%, maxDD:40.68%)

**C. Can we improve our best 2D ANN?**

– ANN(T-0, T-1) with ANN(day), nEnsembleRepeat = 1 (PV:69.5, dStat:59.3%, maxDD:35.71%)** // Avg PvMAPEfrMul is good

– ANN(T-0, T-1) with ANN(T-0), nEnsembleRepeat = 1 (PV:591.2, dStat:57.95%, maxDD:41.01%)

– ANN(T-0, T-1) with ANN(T-1), nEnsembleRepeat = 1 (PV:53.25, dStat:57.07%, maxDD:38.29%)

**1.
Notes about nEnsembleRepeat = 11:**

– in General nEnsembleRepeat = 11 (instead of 1) plays the market on more days. It reduces zeroFrq, and dStat, but increases the PV and maxDD. Because it generally increases the maxDD, I would like to avoid it for this safe strategy. Currently we would like to obtain a very safe, smooth strategy with as less maxDD as possible. We don’t really care about PV. For that, we have other high-performing strategies.

– in our previous study that we did for Matlab, we concluded that nEnsembleRepeat = 11 is helpful based on 4 years backtest in Matlab.

At that time, we concluded that nEnsembleRepeat=11 improves the performance. However, we haven’t considered other effects, like maxDD. That conclusion is still valid in this 24 years backtest.

+ It improves the performance (PV), mostly because it is on the market much more. And because it is on the market more, and we have an edge of 60% dStat, after a while, we can accumulate more profit.

+ However, it is detrimental to the maxDD and smoothness. Why? Mostly because of randomness. (see the excel table that compares the repeat=111 case for nNeurons 1 and 2: with 2 neurons the 111 ensembles can give very random results. With 1 neuron the randomness is acceptable. This randomness generates hectic trades and more volatility in PV)

So, in this safe strategy, we will omit using it. (using nEnsembleRepeat=11)

– probably, **with the AllAgree strategies, increasing the nEnsembleRepeat to 11 or 111 is not a good idea, because, in that 111 prediction, **

It can happen that most of them is ZERO, but

**many of them can be random; and it is the SUM(SIGN()) can be randomly positive or negative.**

Also, what I found, that** nNeurons=1, significantly decreased the randomness over nNeuorons2.**

see The screenshot here:

Considering the screenshot, our conclusion of it is here:

”

prediction for the endday 2011-05-27

Conclusion: 1 neuron significantly decreased randomness

Code:

public int maxEpoch = 49; // use 40-50

static int generalNensembleGroupMembers = 5;

public EnsembleGroupSetup[] ensembleGroups = new EnsembleGroupSetup[] {

new EnsembleGroupSetup() { Nneurons = 1, NNInputDesc = NNInputDesc.BarChange, BarChangeLookbackDaysInds= new int[] { 0, 1 }, NensembleGroupMembers = generalNensembleGroupMembers },

new EnsembleGroupSetup() { Nneurons = 1, NNInputDesc = NNInputDesc.WeekDays, NensembleGroupMembers = generalNensembleGroupMembers },

};

public EnsembleAggregationStrategy ensembleAggregation = EnsembleAggregationStrategy.PlayOnlyIfAllGroupAgree_onGroupSumSign;

public int nEnsembleRepeat = 111;

”

** 2.
Notes about selecting our safe strategy:**

– T-1 input is generally helpful in reducing maxDD. So, I would like to have it in the final (safe) strategy.

– weekDay input: we would also like to have this input in the strategy, because it gives an orthogonal opinion (not technical, but calendar factors) into our decision. And we like to aggregate different orthogonal opinions.

**– surprising to see such a good maxDD performance for ‘ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1’.
maxDD: 32.79%.
Why? Because it has the best dStat: 60.14%.
Why does it have the best dStat?
Because it is the combination of 4 ANNs. And it plays only if all 4 agree. So, it is on the market quite rarely. If we really want a safe strategy, pick this one.**

– based on maxDD, we have 3 good candidates left. Compare them by the PvMAPEfrMul:

– ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1 (PV:20.91, dStat:59.63%, maxDD:35.08%, PvMAPEfrMul: 65.74%)

– ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1 (PV:23.6, dStat:60.14%, maxDD:32.79%, PvMAPEfrMul: 57.72%)***

– ANN(T-0, T-1) with ANN(day), nEnsembleRepeat = 1 (PV:69.5, dStat:59.3%, maxDD:35.71%, PvMAPEfrMul: 57.4%)***

– very difficult to choose between the last two, but let’s compare the maxDD, based only the 1neuron measurements (omit 2 neuron cases), because the 1 neuron case is what we would play, because that is the least affected by randomness.

– ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1 (PV:23.6, dStat:60.14%, maxDD:32.79%, PvMAPEfrMul: 57.72%, zeroFrq: about 30%, maxDD_1neuron:30.1%)****

– ANN(T-0, T-1) with ANN(day), nEnsembleRepeat = 1 (PV:69.5, dStat:59.3%, maxDD:35.71%, PvMAPEfrMul: 57.4%, , zeroFrq: about 40%, maxDD_1neuron:34.01%)

**So, it seems the 4 ANN version is smoother, more stable. It is less time on the market, but still with that the 1 neuron case it achieved PV: 29.22. (that is about 16% CAGR without any leverage)
We would like to play this one as our safe strategy. We expect about 30% maxDD from this strategy in the future, which is moderately good.**

-As a visual proof, compare the PV chart of the 2 strategies:

**“ANN(T-0, T-1) with ANN(day), nEnsembleRepeat = 1”:**

**“ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1”:**

– note the **amazing dStat: like 60.4% directional accuracy. **With a daily prediction! It means that we are significantly better than random (50%). That is our edge!

It means that every 10 days when we are on the market, we can correctly predict at least 1 day direction. Isn’t it excellent? (**other successful ANN applications are happy with 70% dStat with a monthly prediction. But note how much more profit you can accumulate with a 60% accurate daily rebalancing than with a 70% accurate monthly rebalancing.**)

With one caveat: that 40% of the time we are in cash, because we are not confident enough about the prediction. But this is fine. 60% directional accuracy on a daily prediction is significant.

– **compare this strategy to the high performance strategy of the last post.
That was the ANN(T-0 %change, T-1 %change) with PV: $865 (investing $1 in 1987), that is 35% CAGR. MaxDD: 40%. PvMAPEfrMul: 130%.
Here, we developed a low performance strategy (CAGR 16%, maxDD: 30%, dStat: 60%, PvMAPEfrMul: 57.72%). The aim was stability here.**

As a conclusion: ‘play if all agree’ ensembling can decrease maxDD (by -10%), MAPE (good), but the cost is some performance. (by -19%)

However, we feel it still good to play; the degraded performance is OK. It is a trade-off we accept.

We would like to play both of them concurrently. One would double our portfolio every 3 years; the other one would give us stability in bad times.

– somehow, we have the feeling that

‘ANN(T-0) agree ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1’ ensembling can be almost equivalent to the case

‘ANN(T-0) agree ANN(T-1) agree ANN(day), nEnsembleRepeat = 1’, when we double the ANN(T-0) nEnsembleGroupMembers.

If we double the ANN(T-0) nEnsembleGroupMembers, we could have achieved the same results. And it is conceptually simpler strategy, so in the future, we may compare the two, and switch to this conceptually simpler one.

– basically we got the same result as in 2010 December study, in our previous post of the All-Agree ensembling: (in the 4years backtest) (24 years backtest is simply not possible with Matlab) (takes months)

– the best is the pure currDayChange: 28%CAGR (with T-1 input too)

– the 3ANN version (or the 4ANN version) (if all agree) could achieve 12%CAGR or 19%CAGR; (or here 16%)

– **the All-Agree combined ANN cannot be better in profit than the best ANN (than the best expert), but by sharing the decision among different ‘experts’, we can mitigate the risk (maxDD is 30% instead of 40%)**

– We can even mitigate very serous maxDDs by the All-Agree strategy: for example, the pure ANN(T-0) (without T-1 input) can give 65% maxDD(nNeurons=2, ensemble5, repeat11 case)), by combining this with another ANN by the All-Agree strategy this 65% maxdDD can be decreased significantly.

]]>

Our framework is best suited to backtest the variation of 2 parameters. That suits well to the human visual system that can spot ‘interesting’ areas mostly in 2D (as a 2D function projection to the third dimension). We have chosen the number of neurons (nNeurons) and the max epoch runs (maxEpoch) in this test.

We present performance measures PV (portfolio value), CAGR (annual growth), maxDD and PvMAPEfrMul (PV Mean Average Percentage Error From Multiplicative Approximation, the smaller the better) here. We used 200 days lookback periods.

**ANN variations:
A. 1D inputs (weekDays), (T-0 %change), (T-1 %change)
B. 2D inputs: (weekDays, T-0 %change), (weekdays, T-1 %change), (T-0 %change, T-1 %change), (T-1 %change, T-0 %change)
C. 3D inputs: (weekDays, T-0 %change, T-1 %change)**

These are 8 ANN types.

Each measurement used the following redundancy (randomness decrease)

nEnsembleMembers = 5

nEnsembleRepeat = 5;

nTest = 13;

It means that for each 8 tests, for each cell in the following excel tables, a PV chart was evaluated 13 times.

Considering, the excel table parameters (rows, columns) varied as

nNeurons: { 1, 2, 3, 4, };

maxEpoch: { 5, 13, 24, 49, 99 };

generating one excel table, required the evaluation of 13*4*5 = 260 PV charts.

One portfolio chart is 24 years, that is 24*250 = 6000 days, and on every day, we train 5 * 5 = 25 ANNs, so overall we have to **train an ANN 260*6000*25= 39,000,000 times!**

No wonder, it** took about 8 hours to generate one excel table**.

**To run all the 8 experiments with different 1D, 2D, 3D inputs,** it took 8 times 8 hours = 36 hours. As the computer is running only 16 hours per day (not at night),** it took 4 days constant running on a 4-core Intel i7 950 machine.**

Notes on backtesting method:

– using ANN alone (for only one day forecast) is not computation intensive (1 second), but backtesting anything with varying parameters… it takes an awful lot of time.

– we used a lot of redundancy: evaluating a forecast 5*5*13 times, but still, we couldn’t eliminate the randomness inherent in ANN.

– at first sight backesting the ANN(T-0 %change, T-1 %change) and ANN(T-1 %change, T-0 %change) versions appears to be a joke here. They should give equal results. We did we backtest them? We backtested them, because again: we wanted to be absolutely sure that our implementation is correct. If they don’t produce the same result, there is a bug somewhere.

**PvMAPEfrMul** (PV Mean Average Percentage Error From Multiplicative Approximation, the smaller the better):

(for explanation: see the definition in one of the previous post)

As in illustration, we show one of the best portfolio here generated by the 2D ANN(T-0, T-1) strategy.

Note that forecasts and portfolios are quite random.

So, don’t pay too much attention into this specific chart.

(click on the image to see it properly)

Notes:

– the **dayOfTheweek hasn’t got too much forecast power.** (we expected more, so it is a little disappointment). However, note that even if **it is not an excellent strategy, it is not bad either. The dStat is about 54%,** which is much better than a random monkey predictor. It can also achieve a 49% maxDD, which is quite good, because the Buy&Hold strategy on the RUT had 60% maxDD. So, even if its PV is the same as the 5.58 PV of the B&H, we may opt to use this ANN, because it is less risky than B&H. So, we contend that the weekDay input has ‘some’ predictive power, but not enough to play it profitable as a standalone strategy. However, it can be a good addition to another profitable system.

– the ANN(T-0 %change, T-1 %change) and ANN(T-1 %change, T-0 %change) versions give similar results. It is consistent with our expectation.

**– 2D vs. 3D war:**

+ the ANN(weekDays, T-0 %change, T-1 %change) is a little bit better (really, only slightly, but it is not worse) than the ANN(weekDays, T-0 %change), so, it seems that it was worth to add an additional input (T-1) to the 2 dimensional input. So, it seems that adding more input is better!!!

+ the ANN(weekDays, T-0 %change, T-1 %change) is worse than the ANN(T-0 %change, T-1 %change), so it seems that adding another additional input (weekDay) was not a good idea. So, it seems that adding more input is not better!!! (Note the contradiction to the previous point)

+ it is interesting to see that **not the 3D input version is the winner. So, it is not always true that adding more input is better.**

– **1D vs. 2D war:** having **T-0 input as a base, it seems that adding T-1 as an additional input is a good idea, but adding weekDays an additional input is a bad idea** (it decreases the result). That is very interesting and difficult to explain. It seems that inputs form ‘natural groups’. There are inputs (like T-0, T-1) that are connected (by nature), they strengthen the prediction (one depends on the other). However, there are inputs that are independent (like weekDay, T-0). Suppose, you as a trader know that yesterday was a Down day. What helps you more?

+ if you additionally know that the day before yesterday was also a huge down day, you may think that the market is oversold, you can forecast tomorrow will be a bounce Up. The two inputs are dependant. Even if, T-1 per se, has no predictive power, combining it with something in which it can express a value worth taking. Yes, it is probably true that after having two Down days, an Up day is more probable (because markets are not totally, independently random; days are not independent)

+ if you additionally know that today is Friday, then what? You are not too much better in forecast, because the new input (weekDay) is independent from T-0, and unfortunately this independent input has no real predictive power.

– Based on this, we propose that we should **group inputs into dependable groups**. Probably the best way would be to use an ensemble that uses 2 ANNs. One uses the 2D(T-0, T-1) input the other using the 1D(weekDay) input. The ensemble forecast would be a combination of these 2 separate ANNs. This would add some stability. We will test this idea in the next post.

– We highlight our **best strategy so far: the ANN(T-0 %change, T-1 %change).**

**PV: $865 (investing $1 in 1987), that is 35% CAGR.** Imagine that using a double or triple ETF or other kind of leverage.

For example,** playing with Ultra ETF (double leverage), and playing the equity curve by 50% additional leverage, we may expect 3*35=105% CAGR.**

Even if you consider friction costs (commission, askbid), almost 100% CAGR is possible.

The directional accuracy is 56.51%. Remarkable. However, note the risk: In the backtest, we show 35% maxDD, but looking at the maxDD table, it is more realistic to expect at least 40% maxDD. Using Ultra ETFs (double leverage), that maxDD is 1-0.6*0.6= 64%. Note that playing the equity curve will not increase the maxDD, only the return.

**Can we bear the pain of 64% maxDD? If we have 105% CAGR, the answer is: yes.
Compare it to the RUT Buy&Hold that has 5.5% CAGR and 60% maxDD during that period.**

– nNeuron parameter: **It is a general trend (in almost all 8 backtests) that using the less number of neurons is the best. **The best performers are the nNeurons = 1, 2 cases. The 1 neuron was almost always better than the 2 neurons (that is strange, we expected that for high dimensional inputs (2D, 3D), the 2 neuron case will be the winner). The 4 neuron case was always the worst. Expected. It tends to overfit.

– maxEpoch parameter: it was **weird to see that increasing the maxEpoch to infinity in general make the prediction instable.** We can’t really explain that. In general, after more training, the NN should perform better. Considering the 3D input case, the 99 maxEpoch result is almost worse than the 13 maxEpoch case. Hardly explainable. Overfitting again, but why overfitting is the problem?

It suggests that we shouldn’t increase the maxEpoch.** If we want better prediction (with the same inputs), we should decrease randomness not by over-training, but by redundancy (put more members in the ensemble).**

‘Luckily’, in our best performing ANN, the ANN(T-0 %change, T-1 %change) and in the nNeurons 1, 2 case, the maxEpoch 99 is about the same (a little better) than the maxEpoch 99.

– best parameters based on this backtests: we would use the 2 neuron and 49 as maxEpoch. (albeit 1 neuron case a slightly, slightly better)

**Conclusion:
The test shows us that combining inputs sometimes improve the result, sometimes not. It is difficult to know in advance, but we proposed a theory of dependent/independent inputs. We found that adding T-1 %change to the T-0 %change input (2D input case) gives the best performance. (unleveraged 35% CAGR, leveraged: 100% CAGR)**

]]>

Let’s suppose a simple case: In the last 200 days, the average of nextDay%Change of the Up and also the Down days were positive. For example, because **we are in a generally bullish period.** That can easily happen. Just take a look how long the RUT/SPX can stay above the SMA(200). How will the deterministic 2bins naive learner decide? It should vote positively for both Up and for Down days. There is no daily MR, no daily FT.

In that case, it is very likely (not assured, because the NN is not a 2bins forecaster, but continuous), that the **NN will cast an Up day irrespective of the input, the currDayChange. Does the success (30% CAGR) of the NN, because it correctly just says that it is a generally bullish market?
The answer:
No.**

We prove here that if we project generally an Up vote for next day, irrespective of the input currDayChange, we will have no phenomenal results.

To prove it, instead of using a 2bins Naive Learner, or a 4bins NR that we used earlier, **we use a 1bins Naive Learner. **(What a great idea! How could we miss testing it earlier?) This will aggregate all the outputs in the previous 200 days period to simply to an average number.

The sign of that average will determine the next day forecast.

Let’s see the PV, CAGR and dStat values that compares the

– Buy&Hold,

– daily Mean Reversion

– daily Follow Through

– 1bin Naive Learner

– 2bins Naive Learner

– 4bins Naive Learner

– Stochastic Neural Network:

Conclusion:

As you can see, the **adaptive Naive Learner gained nothing, not too much (5% CAGR) if it only considered the average nextDayChanges during that 200 days (1bin case).
However, it has genuine prediction power (27% CAGR) if it distinguishes current Up days from current Down days. (2 bins case).**

So, we can contend that the currDayChange as an input is a very good input.

**The goodness of currDayChange input is really because of the nature of it (either MR or FT days), and not because there is a general bullish/bearish trend in the previous 200 days.**

In general, we expect that each of our input (per se) has only weak correlation to the output, and the aggregation of many inputs, many weak prediction power inputs will give us an edge in the form of the ‘input-combined’ Neural Network.

]]>

In a practical case, we use 200 samples in the training set. The arithmetic mean is always around zero. Let be the SD of that 200 samples is 1.3%. Using the thresholdMultiplier = 2.0, we will clip all the values that are less than -2.3% to -2.3% and we will clip all the values over 2.3% to 2.3%. (suppose the mean is zero).

In our backtest,

0.5, 1, 2, 3, Double.PositiveInfinity.

Note that **we use this outlier clipping only for the input. For the output, we have always used the outlier elimination with a fix 4% threshold.**

The **vertical columns represent the number of neurons** used. We don’t intend to give too much importance on that dependence, so just **ignore it. The focus is on the clipping multiplier.**

ensembleMembers = 5;

nTestsPerCell = 5;

We would ignore the PV table, because it is too much of the randomness (the daily %change variation on the next day).

We would regard the dStat as the measurement of choice in this test.

**Based on dStat,
it seems that the Double.PositiveInfinity is generally the winner.** Even if it is not the winner, it is not far from the winner.

(we may later test the same for the output. For that it definitely works in some cases (the proof is that the output outlier elimination is essential and it works)

]]>

In some sense it measures the risk.

We

**Methods for measuring risk:**

1.

Usually, a highly variable PV has a high **SD (standard deviation)**. That is one measure of the inconsistent profit.

**The drawbacks of the SD** as a measure of risk:

– **it measures volatility around the Mean. Blah. That mean is static.** In a 23 years backtest, measuring the variance around a static mean is silly. The mean should move. (as with a SMA, or EMA)

– **it is an absolute value and it is not a percentage of the PV. So, if a strategy takes 20 years, it is naturally achieve higher PVs at the end than a strategy backtested for only 1 year.**

This high PV at the end result that the SD will be high in all case.

– Symmetric around the mean. SD considers the PV above the average equally bad as under the average. (Usually investors are happy if there are above the mean, and very unhappy under the mean curve.)

2.

Another widely used risk measure is the **maximum drawdown.**

It measures risk only under the mean; in theory it measures only the bad risk(, not the good risk (the good risk is risk to the upside)). That is good.

**It measures the deviation as a percentage** (not the value of the underlying). That is good too.

However some drawbacks:

– it says nothing about how long the maximum drawdown lasted. We call the Maximum Drawdown Suffering days the number of days until we are in the MaxDD valley. It does matter greatly that the suffering days are 10 days, or 10 years.

– also, what about the second, third, etc. MaxDD? What if the MaxDD was -50% and the MaxDD suffering Days was 10 days, but the second largest DD was -45% and its suffering days was 10 years? MaxDD doesn’t tell us about this behaviour along the PV curve.

3.

We need something new.

First we realise, we need to work on percentages.

We suggest using 2 measures here:

-The average percentage deviation from the linear approximation and

-The average percentage deviation from the multiplicative approximation.

The MAPE stands for Mean Absolute Percentage Error: see wiki here.

**The Linear PV approximation should be a simple linear function connecting the first point to the last point. **

(note: **it is not the linear regression line. It could be**, but this linear PV is much simpler to compute)

However, there is one caveat here:

LinearApprox is not a perfect choice in long term backtests. In a 23 years backtest, the daily ‘deltaInc’ (see source code) is $0.05. That is 5% daily profit on the first day, but the same $0.05 is only 0.0001% profit on the last day. The problem is that it is the additive profit factor.

So, let’s **introduce a multiplicative approximation.** That **calculates the geometric mean of the daily profits, and uses that to estimate a smooth PV
that could have been generated by our strategy, if our strategy was perfectly consistent and generating the same profit% every day for 23 years.**

Here is the code of the two error metrics:

and here is how the linear and multiplicative approximation look like (in a linear scaled chart!!!)

Note, it would look very differently in a log scaled chart. **In a log chart, the multiplicative (green) would look linear.**

It is surprising to see that the best, **most consistent approximation of the PV line is not the linear, but the multiplicative one.**

In the future, we would like to **use the MAPE_from_MulAppr. metric to express the riskiness (the volatility) of the strategy**‘s return.

]]>

When we test these artificially created patters, there is no point using the gCAGR as a performance measure (as it can go to infinity in the good case), so we use the D_stat, the directional accuracy as a measure to check that the ANN learn the pattern or not. (We could use the RMS error too, but let’s stick to the dStat in this post. We despise the RMS anyway.)

Just for comparison, predicting the RUT (Russell 2000) index, we achieved 56% directional accuracy.

We usually contend that an

We know that by its nature, when we generate **completely random time series, it cannot be predicted.**

From **logic theory use the ‘Modus tollens’** to express why we do this experiment. (**We have a falsifiable case; let’s try to prove it wrong.**)

**“A->B” is true rule, that induces that => “^B->^A” is also a true rule.**

Let’s fill A, B with concrete statements:

“if the (Encog NN works properly) -> (it cannot predict the random time series)”. (that is the statement A->B)

so, it induces that: (“^B->^A”)

“if not(it cannot predict the random time series) -> not(Encog NN working properly) .”

Not(not(f)) equals to f.

So, **“if it can predict the random time series -> Encog NN doesn’t work properly.”**

That is why it is so important to check this. To check that it can predict the random time series or not.

If it can, that would be evidence that there is a bug somewhere in our implementation or in the Encog framework.

Note also, that **even if we find that it cannot predict the random time series that is not a proof that it works properly.
Proving that it works properly is impossible. Only proving that it doesn’t work properly is possible.**

And we will try to prove this in this experiment. (Wiki ‘falsification ‘ if you wish)

This chart shows all our measured directional accuracy (dStat). One cell represents 5 tests average. (Click for crisp image)

On the vertical axis, we increased the number of NN neurons from 1 to 5.

As we expect, we see that in general, increasing the neurons will increase the prediction power accuracy.

**0. RUT prediction, nNeurons dependence.**

Apart from the dStat, we measured the Portfolio Value too. PV conveys no meaning for the artificial patterns, but it is worth looking at it in the RUT index prediction case.

At first sight,** it seems that using only 1 neuron is better for the PV, however, note that the STDev for the 1 neuron case is the highest among all.**

So, this PV = 586 (that is composed of averaging 5 experiments) contains an outlier, and in practice the situation is not as rosy.

**The same can be seen on the RUT dStat SD chart (we don’t show it here). Therefore, at the second look, we have doubt that the nNeurons=1 case would be the one we would use.
Let’s say now that this war is not yet decided.** We will repeat this trial&error experiment later, but so far, let’s suppose we use the nNeurons=2 case in the future for RUT prediction,

even though nNeurons=1 seems to be better (but we reckon that it is again

**1. Artificial patterns: random**

Let’s try random input artificial pattern: Gaussian and uniform distribution.

**– uniform(-1%..1%)
– Gaussian(0% mean, 1%STD)**

We are happy to see that the **ANN couldn’t predict this random pattern.** So, we cannot prove that the ANN doesn’t work properly.

(However, we didn’t prove either that the ANN works properly.)

**2. Artificial patterns: deterministic**

We need some patterns that are deterministic.

We picked 3 different versions:

**-2 period: 1%, 2%, -1%, -2%, (repeat this pattern)**

(The ‘2 period’ here means that in theory an ANN that can look back to the last 1 period, could synthesize the rules. So, the patters are 2 period patterns. F(t-1) always determines F(t))

The pattern the ANN has to learn is this. The ideal function is the green one,

however it is evident that with 1 neuron it is impossible to learn the ideal function.

The question is what kind of ANN surface we got after training this pattern.

Here we show one examples for nNeurons = 1, 2, 3, 4, 5 case consecutively.

We inspect that **with 1 neuron the surface is quite linear. **This a**lso a suggestion that even if the RUT prediction case gave better result with 1 neuron than with more,
but we shouldn’t use the 1 neuron ANN in real life. It is just too simple to have too much predictive power.**

Note also that

**-3 period (with 2 confusions): 1%, 0.5%, 1%, %-1 %-0.5 %-1 (repeat this pattern)** // 2 confusions in it; for -1, +1,

**Note that it is an impossible task for this kind of ANN.** We have a 6 days pattern that repeats, and we feed the ANN only the previous day change%.

However, without the confusions we placed into it, the ANN would be able to learn it completely, with 100% accuracy.

**But even in this impossible to solve case, the ANN fares reasonable well.**

This is similar how humans operate in the world. We make predictions, based only just part of the information that is necessary to make a correct forecast.

We are humans; we cannot have all the information that affects a dynamic system.

**-3 period (with 3 confusions): 1%, 0%, 1%, %-1 %0 %-1 (repeat this pattern)** // 3 confusions for it: for -1, 1, 0

It is the same as the previous, but we complicated it even more with **another extra confusion.**

Interestingly, **the ANN could learn this one better than the 2 confusions case.
That is understandable if you compare the green lines, the ideal functions. In this case, the green line is smoother, simpler than the 2 confusions case.**

So, we learnt that the ANN is more successful, if the function to estimate is simpler.

Unfortunately, in real life it is rarely the case. Financial markets has complex relationships, complex dynamics, very far from being simple.

Note that

If the ANN trains a surface that is positive at 0, it will predict +1%, but it will fail on -1%.

– 2 confusions and 3 confusions D_stat difference is exactly 1 out of 6 different values. (75-58=16.6= 1/6 of 100) That is understandable, because of an extra confusions, we miss another value.

Also interesting to see that 75% = 4.5 * (1/6 of 100), so the 3 confusions case misses 1.5 out of 6 times.

**3. Artificial patterns: semi-random**

Random, but predictable

this is the code:

This pattern has 10 period length, in which there are 2 period patterns.

It repeats the next pattern: **1%, 2%, 3%, 4%, 5%, -1%, -2%, -3%, -4%, -5% (repeat)**

**At each step a uniformly generated random number is added to this deterministic pattern.
We generate the randomness with various extent: 0%, 0.5%, 1%, 2%, 4%.**

Note that in the dStat table when we increase the randomness to 4%, it becomes hardly predictable. That is expected too.

The dStat table shows encouraging results. A**s we increase the randomness, the prediction accuracy diminishes. Well done**!

Discussions:

– overall we like the outcome of the experiment: **the ANN couldn’t predict completely random time series**, but

– **it could predict deterministic and semi-deterministic strategies.**

**– as we increased the number of neurons, the ANN could predict the deterministic patterns better (see SemiRan_r0 column), but
– as we increased the number of neurons, it predicts the randomized versions worse (SemiRan_r4, and RUT columns).
**Therefore,

– even if it is deterministic pattern that we want to train with our system, and humans in theory can find a deterministic rule, a 100% sure and accurate algorithm (D_stat=100%), the NN approach is still better and advised. Because, it is like human thinking, it is stochastic, non-deterministic as life is.

When you are an amateur chess player or amateur trader, you like rules, you like a 100% deterministic algorithm. W

**The randomness that we experience with the ANN, it is also present in the human mind.** Sometime, we miss an important figurine (a random figurine) on the chessboard when we match the input pattern. If the grandmaster chess player does another experiment with the same input, he may decide differently. Maybe he (by pure chance) remembers something about a previous event in his memory.

The human chess player runs different random experiments concurrently in his mind, in the background. One of the processors (threads) will win and gain the attention of the conscious mind focus. That solution will be selected by the grandmaster, but in case that there are many parallel PU (Processing Units) running in the background it is quite random which one will be selected. (this is the ensembling mechanism of the ANN)

**Most of the time, the global minima (best solution) is not found, but the solution that is found is reasonable good that the grandmaster wins in the long term. The human experts work exactly like parallel ANN processors with an aggregation, ensembling mechanism that selects, picks one solution from the candidate function approximators. **

So **we shouldn’t worry about the fact that ANN is non-deterministic**, random and aggregation of the ensembles poses a problem. The human mind does the same. In the long term, if enough experience is learned all candidate PUs will cast reasonable good estimates.

]]>

We also show in the chart the Buy&Hold and the Naive Learners (both the 2 bins and 4 bins version) as we introduced them earlier.

For this experiment we set the

OutputBoost = 1;

maxEpoch=49;

We are also curious how much changing of these parameters weaken our previous good results.

We used the following **aggregation strategies:**

**1. Return the most frequent sign:**

forecasts.Select(r => Math.Sign(r)).Sum();

Note that in our previous Matlab study on aggregation, we made a note that:

“//SumSign() has lowest STD than Avg() (we measured it)”

Therefore, we used the Sum(Sign) in all our previous experiments.

We favour that the number of ensemble members to be odd and not an even number to avoid the case when the positive forecasts equal to the negative forecasts and cancel each other out.

**2. The average:**

forecasts.Average();

The obvious aggregation is the average. **The disadvantage of it that very badly trained members can forecast outliers, very high values. Only one of this outlier is enough to completely distort the aggregated decision of the ensemble**. **For example, we have 21 members. 20 forecasts say: +1%, and 1 forecast say: -21%. The average is negative.**

Obviously, we should select the positive direction, but averaging the forecasts will give a negative prediction of the ensemble.

No wonder, **we don’t expect that this aggregation method would be the winner. **But, let’s test it.

**3. The best trained ANN:**

The idea is that after training 21 NNs, **we keep the one with the smallest training error as a forecaster:**

forecasts[trainErrors.IndexOf(trainErrors.Min())];

The smallest training error means that **this NN gave the smallest MSE error on the training set (200 samples minus the outliers excluded)**. Note that **MSE says nothing about profitability or drawdown**, so another potential method would be to keep the ANN with the highest profit or the lowest DD on the training set.

However, we haven’t implemented this strategy yet.

For each cell in the next tables, we did run 5 tests and averaged that performance metric. **For each ensemble strategy we run 3 tests, so we have 3 cells with the same strategy.** **To check the consistency.**

**A. The target function is the next day %change**

The performance if nEnsembleMembers = 5:

Portfolio Value and directional accuracy after 23 years:

The performance if nEnsembleMembers = 21:

Portfolio Value and directional accuracy after 23 years:

**B. The target function is the next day %change direction** (-1, +1, but normalized by STD normalization)

We obliged to test this, *because our ensemble aggregations treat the forecasted output as sign only*. So, we thought to **test what happens when the output of the test set is the sign only too.**

The performance if nEnsembleMembers = 5:

Portfolio Value and directional accuracy after 23 years:

The performance if nEnsembleMembers = 21:

Portfolio Value and directional accuracy after 23 years:

Notes:

– the Naive Learners are deterministic.

Conclusions:

–** the 4 bins Naive Learner is superior to the 2 bins version. Expected.**

– **some stochastic NN version can beat the deterministic 4 bin Naive Learner.** At first, it is good to see that it happens. **Without this result, we may contend that we better don’t complicate the forecast by training a complex NN**, but we should only use the deterministic NL. However note that even if the NL gives better performance than the NN in this case, we would choose the NN. The reason is that in this simple case,** with only 1 dimensional input, and with a non-sparse input space, it is easy to create a deterministic NL forecaster. However, increase the input dimension to 10. Try to create a 4 bin NL for those 10 dimensions. Each dimension is divided to 4 ranges. For 10 dimensions, the number of hypercubes is 4^10 = 1,048,576. And consider we have only 200 samples in the training set. 99.9% of the NL statistics matrix would have no sampled value. What do you think about the prediction power of that discretization? **

How would you predict for a testInput that belongs to a hypercube where there is no training sample at all? Choose the neighbours? But which neighbours? It is easy to see that we are stuck with this approach. This approach would be local.

**The good thing about NN is that it can operate even in this complex (sparse and 10 dimensional) case.**

– because of **the previous post, we made the NN learner less effective by changing the outputBoost from 30 to 1, and the maxEpoch from 20 to 49, but still with this less effective (non fine-tuned, non parameter overfitted) settings, the NN is better than the deterministic NL (2 and 4 bins)**

– **we had high hopes for the ‘best trained ANN’ ensemble strategy, but it didn’t work fabulously. In fact, it shows the worse performance.
**

–** ‘the most frequent sign’ aggregation is still the winner.** This is what we have used so far. We keep it.

–** it is preferable to learn targets as the %change and not the sign(%change). We give more information to the NN that it can use. In this case, more information to the NN means better prediction.**

]]>

**1. Initiate weights to constant zero.**

We had an idea that **to eliminate the stochastic** nature of the NN, we would initiate the weight and bias matrix to deterministic values. The obvious choice would be to initiate all of them to zero. In Encog, the network.Reset() function sets up the initial weight randomly.

However, if the initial weights are left to zero; (network.Reset() is not used), **NN will have zero weights after the Resilient or the Backprog training is finished. Oops!!** Only the last biasWeight will be non-zero after the training. This means the output will be the same regardless of the input. That is bad, meaningless training. This is logical, because the NN surface is constant, and there is no derivative of that function, so the training cannot move into any direction.

– **We may initiate the weights to some other constant (non-zero) non-random value**, like 1 (or -1?), but it is easy to see that t**his would be arbitrary and would induce bias** into the whole process. (The result would depend on that initial chosen value) We prefer non biased solutions.

– **It would be another parameter** of the training and we don’t want to add another parameter. Keep it simple: Occam’s razor. Another parameter would only complicate the process.

– Not to mention that starting the search in the weight space from a deterministic point hugely increase the chance that we won’t find the global minima, only a local minima during the optimization process.

This suggests that honestly **we cannot eliminate this source of randomness. **

**2. maxEpoch dependence.**

**So far this year, we used maxEpoch=20** as a parameter. This value was determined by trial & error. We did run a couple of backtest and thought this is a good value. (See previous posts.) Here we study how the NN surface changes as we increase the maxEpoch.

We use the first backtest day forecast (that is the day 201 in this 23 years data, because we use 200 days lookback) for plotting charts.

Averaging the outputs of the test set on day 201:

– when the input is under 0: -9.81

– when the input is over 0: 8.54

This forms the target function the NN wish to approximate.

Note that these output values are the boosted outputs, so it is futile to try to understand it as daily percent changes.

Based on these targets, we expect **the target function T(x)** to be:

T(x) < 0, if x 0, if x > 0

That is T a monotonous increasing function.

**Let’s see the NN function, the nn surface that tries to approximate this.** In the next charts, the X axis range spans from -10% to +10% as current daily %change. At least, it is equivalent with this range, but the normalization distorts the X values. (Just ignore it).

**MaxEpoch = 10:**

**MaxEpoch = 500:**

Note that the **MaxEpoch=10 plot is not as smooth as the other one. It seems that the training process hasn’t converged ye**t. Also note the output range distribution. The smoother, more converged NN is more evenly fill the Y range (from -2 to +2). We like the maxEpoch=500 case better. It suggests that the NN has finally converged.

Let’s see some performance measurements as a function of the maxEpoch.

Portfolio Value$ after 23 years:

The PV shows that the optimal is the maxEpoch=21, however we think it is only a winner because of some randomness. (Overfitting).

Directional Accuracy%:

Take a look at the D_stat chart. In the 9-10-11 case, and in the 14-15-15 case, it is very unstable. That we don’t like. It means it is far for being converged. The 19-20-21 case seems to be acceptable, it is not too volatile, but to be sure, we prefer even greater values for maxEpoch.

Avg Training Error (in % as defined by Encog):

For us the answer is given by the Training Error chart. If we have much computational resources, we would train until 100 epochs at least, that is a good trade-off. Training even further, until 500 epochs gives less error, but the training time is multiplied by 5. Taking into account that increasing the maxEpoch from 50 to 100, requires double computational effort, we will **suggest to use the maxEpoch=50, or the maxEpoch=40 case in the future.**

**3. Observe 40 different trainings.**

Another proof against using only maxEpoch=20. (and preferring a greater value)

Take the first forecasting day (day 201).

Repeat ourselves:

‘We use the first backtest day forecast (that is the day 201, because we use 200 days lookback) for plotting charts.

Averaging the outputs of the test set:

– when the input is under 0: -9.81

– when the input is over 0: 8.54

‘

**On day 201, the current daily %change is negative (that is the testinput), so T(x) should be negative, as x is negative. Therefore, we expect that the NN should give a negative output.**

**We run 40 different training and measure the forecast value of the NN, both with maxEpoch=20 and 2000 as parameters.**

**29 out of 40 is negative. (in the maxEpoch=20 case)
40 out of 40 is negative. (in the converged maxEpoch=2000 case)**

This also suggest that the maxEpoch=20 trained network is not yet converged.

**So, as a conclusion, in spite of the fact that the maxEpoch=20 case produces the greatest profit, our reasoning to use a higher, more converged value:
– forecast instability (in this 40 different random training)
– training error is high and instable (see chart)
– D_stat is instable (see chart)**

Note that increasing the maxEpoch is an O(n) operation. So, we cannot increase it too much. We would love to increase maxEpoch to 20,000 instead of 20, but that would mean 1000 times computational time. So, we suggest using maxEpoch=40 or 50 and we also suggest using several different NN in the ensemble. This will also assert that even if one training is stuck in local minima, other NN trainings may find the global minima too.

**4. Tricky NN surfaces.**

Just as an example, note a little extreme NN function. **Sometimes, even after 500 maxEpoch, we got this function.**

The training set was:

Average under 0: -0.048

Average above 0: -0.040

We guess that it is a local minima in the NN weight space. We doubt it is the global minima. However, for a set of inputs, it gives reasonable, correct response.

It is interesting that this kind of complex function can emerge even with a relatively low number of parameters: 2 neurons, 2 biases in the hidden layer, 1 bias in the output layer. This is only 5 weight values.

or another one:

Average under 0: 0.056

Average above 0: 0.023

We reckon that **even if the average values under 0 is positive, the far negative values produce very negative results.**

Easy to imagine: if today we have huge -4% losses that will induce that tomorrow we again have negative %change (because probably this is a bearish environment). Maybe not as big loss as today, but a small loss.

However, this will also occur if we train the NN not on the nextDay direction (binary training: train +1 or -1 in the training set), because even in that context, today negative values will induce that tomorrow we have another negative number (-1), because of the bearish environment.

So, because of this real life behaviour, we cannot expect that the NN surface will be a decreasing function.

En example of the NN surface for this binary output training (+1, -1 outputs, input is not binary) can be seen here:

Or another here

Average under 0: 0.010

Average above 0: 0.001

Note that both are positive, but honestly, don’t expect that the NN function will be positive everywhere:

**This is amazing image.** **This behaviour (small negative is bullish, but big negative is bearish; small positive is bullish; big positive is bearish) that we expects most of the time** and we explained in many of our post previously. This is an amazing image, because this continous smooth behaviour cannot be approximated by a simple 2bins Naive Learner method. We need non linear functions (like the NN tansig()) to represent this kind of behaviour. If you want to remember only one chart that what is the function that the NN represents, remember this one.

**5. The outputBoost dependence.**

**So far we used outputBoost = 30.** It effectively meant that the output was scaled in a way that the std.dev. was mapped to -30 and +30 (mean = zero). **The usual NN literature suggests scaling the output SD to -1 to +1. However, by trial and error, we have found that outputBoost=30 gives the best profit.**

**This was probably on overfitting again.**

In spite of it produces better performance;** we will not use it in the future. It distorts the training. We will use outboost = **1, that is no boost at all.

See the LayerOutput values in the debugging process:

**The output of the layers very quickly reach such high values that the TanSig() function produces either +1 or -1 (to its boundary range), **and no in-between values will be produced by the tansig(). This is meaningless result again, because **for all the input values, the output is the same. (the NN cannot discriminate) Note that this is occurred, because the very high values in the Weight matrix: -1154 and -994.** For these high bias or neuron weights, the summation will be a very high value. And **the tansig() virtually maps all the input values that are less than -4 or greater than +4 to its limit boundary value (+1 or -1).** That happens very quickly as we increase the outputBoost. We want to avoid that the weight matrix (neuron or bias) has such a high values and that the NN cannot discriminate.

Because of that, look at the NN surface chart

**It is not smooth, but zigzags as a staircase. **

We learned another important lesson here.

In spite of outputboost=30 gave the highest profit, we object to use it in the future. We will not boost the output any more. (boost = 1).

**These are the insights that are only got by rigorously debugging, plotting, inspecting and understanding the all little details of this very simple NN.** And that is which is usually missing from other NN users work. They are just too lazy to spend time on understanding it. They only want to use it. However, NN is such a complex and delicate automaton, that we don’t think that great, reliable and stable results can be reached without understanding it.

**In brief, we gained three insights here:
– we cannot eliminate randomness by initiating the weight matrix to zero.
– use maxEpoch=40 at least (or 50 or 100) instead of 20 (even if 20 produced the best profit).
– use outputBoost = 1 (no boost) instead of 30 (even if 30 produced the best profit).**

]]>

– RUT is less popular than SPX or DJI in the trader’s community. Therefore, less likely that clever traders/computer bots optimized or removed the inconsistencies of that market.

– RUT has higher beta (higher variance) than the other indices (RUT is comparable to the HSI beta), therefore if our method can produce alpha (profit), that alpha is more expressed in RUT than in the other, slower moving indices.

In this study we tested the following 5 indices: RUT, SPX, QQQ, DJI, and HSI.

Luckily all of them are available from 1987-09-10, so the comparison is fair, because all of them are based on the same period.

However, note that the number of days is not perfectly equal. For example HSI has fewer days, because the Chinese market has different bank holidays than the USA market.

And there are also slight differences in terms of the number of days even in the USA indices, but it is hardly worth mentioning.

We tested different strategies.

-Buy&Hold,

-daily Mean Reversion,

-daily Follow Through,

-Naive Learner with 2 bins and 4 bins and the

-continuous NN prediction.

For geeks, here is the code we used for the different strategies.

And some performance charts.

Portfolio Value at the end, assuming $1 invested:

Geometric Cumulative Annual Growth Rate %:

We highlighted the cells that were discussed in our previous posts. Note that they are not exactly equal to our previous measurement. The reason is that another 1 month has been passed, and in this study we used price quotes until 2011-03-29. The previous studies used quotes until 23th February.

Notes:

– The most important conclusion is that the **RUT performance is better in all the strategies. We are not suprised.**

– looking at the **Buy&Hold**, and the CAGR table, we read that **the Chinese HSI gave the most return: 10.27% annual**, QQQ: 9.28%, DJI: 8.31%, RUT: 8.1%, SPX: 7.47%. To be frank, we are suprised the Buy&Hold performance of the DJI. We expected them to be the less profitable. However note that in 2011, we are still in the aftershock of the 2008 financial crysis, and we assume that DJI fall less in these years than the other stocks. That gives DJI a relative advantage, but we don’t expect this advantage to be kept in the near future.

– we made bold the Naive Learner(2bins) strategy results. We would like to emphasize its importance. This is what is very easily achievable by almost any adaptive (trained) strategy. Instead of playing rigid pre-determined rules (like MR, FT), we should adapt our rules to the last X days (200 days in our backtests).

– in the CAGR table **compare the Buy&Hold annual profits against the NaiveLearner profit. The adaptive NL was always better than the B&H** strategy. It was only slightly better on low beta indices

DJI: 8.31% B&H, 8.55% NL2bin

SPX: 7.47% B&H, 9.5% NL2bin

but inspect how significant the gain was in high beta indices:

HSI: 10.27% B&H, 17.81% NL2bin

QQQ: 9.28% B&H, 25.03% NL2bin

RUT: 8.1% B&H, 27.89% NL2bin

Note however, that playing this in real life may results daily rebalancing with significant comission and ask-bid spread loss if the zero-cost funds are not used.

(but we suggest to use them)

– in the CAGR table** compare NN learning strategy against B&H: NN is better everywhere** (albeit in DJA the difference is negligible)

Note that the backtest used 51 NN ensembleMembers for voting in the NN strategy case.

– in the CAGR table **compare NN learning against Naive Learner 2 bins: 3 out of 5 times: NN better than deterministic Naive learner. **So, we can say that **the complex and difficult NN strategy is better than the simple Naive Learner. This is something that we have strived for. Without it, there is no point to do complex NN learning and our efforts are not well rewarded.**

– **the directional accuracy is highest in the RUT case. Probably, because of our optimization.**

– **we are not concerned that these backtest performances with indices of (SPX, QQQ, DJI, HSI) are lower or slightly lower than the RUT case.** There are **2 reasons of it:**

1. among the indices, the RUT gives the highest return. No wonder, **RUT has the highest volatility among all of them**. Note that the more volatile the index, the more profit we have with the training algorithms (DJI is the less volatile, that is where we have hardly any extra profit compared to B&H

2. don’t expect as good performance as for the RUT, because **we optimized our method based on RUT**.

Optimal outlierThreshold: 4% (for SPX, another would be better)

optimal inputBoost and outputBoost;

optimal lookbackdays for learning: 200

– So, **it is not bad that we got poorer results in non RUT indices.** However, **it would have been a big warning sign if we had got negative profit, loss on other indices.**

Indices have different characteristics, we would optimize the parameters to different values according to the underlying index character.

Note that optimized parameters based on the past don’t guarantee the same perfarmance in the future.

In fact, they guarantee that the performance will be worse.

However, we cannot do better than using the past, using the recency to optimize the parameters. By betting on that in the future, similar parameters will be optimal than in the past, we assume the optimal parameters will not change too much. That is the correct strategy to do.

We don’t expect that the stellar past performance will be repeated again, but we expect a little bit smaller, but similar performance in the future.

– The whole study **suggests another strategy: self optimizing NN. If the main cause of these underperformance in (SPX, DJI, HSI, QQQ) is the parameter optimization (and not the low beta), we have a solution. Do parameter optimization ‘on-the-fly’**. Based on the last 10 years data (or all the available past data). For example, this is how to determine the optimal lookback days. Currently, on every day, we train 1 NN with 200 days past values. Instead of this, on every day, do a backtest with a fix set of 50, 60, 70, 80, .. 380, 390, 400 lookback days. That is to train 35 different NN for EVERY day. Do a running backtest, in which you keep track of the performance of all the 35 NN. On a given day, the self-optimizing NN strategy would select that NN from the 35, which is the winner and would play that. You can imagine the computation requirement of this backtesting .

]]>

See the details there; for example there we explained what those different MMType values mean in this test.

That strategy (the naive learner) had some losing streaks and has some winning periods. So, it was a good candidate for the playing the equity curve method.

**Playing the equity curve means that we time that strategy. It doesn’t mean that we time the market! That is very-very important observation here. Repeat. We time the strategy!**

Even in a bear market, we can have a strategy that is a winning strategy. In that case, we should amplify the strategy and not stop loss it.

Playing** Buy&Hold is another strategy** that means we buy $1 in the Russell 2000 index in 1987 and we hold it until today. Let’s see what we achive in 23 years with or without the money management timing.

Portfolio Value charts. Starts at $1.

**Buy&Hold** (MMType=2, leverage = 1), without money management:

Let’s see if we have **50% reduced position under MA: **(MMType=6)

MA(50):

Note how nicely the maximum loss, the DD is reduced by this technique. (Compared to the original, leverage = 1 case)

Let’s see if we have **50% reduced position under MA and 50% increased position above MA**: (MMType=7)

MA(50):

Note that with about the same maxDD, the PPV increased significantly. (Compared to the original, leverage = 1 case)

And the whole performance measurements in tables: (PV, maxDD and CAGR charts)

Some notes:

– Between the MAs, the **50 days MA is the winner.**

– the MA200 and the MA1000 actually decreased the performance (albeit improved the maxDD).

**The MA200 is commonly used technique by Buy&hold investors.**

Note that this is not working too profitable way here (but we cannot say that it is too bad).

Using MMType=5 **(playing 0 leverage under MA200), the PV is 3.33. That seems to be bad compared to the Buy&Hold’s PV of 5.38.
However, note that the maxDD is improved from -59% to -44%. So we see that the PV went worse, but the maxDD improved.
We usually can improve one thing: profit or volatility. It is very rare that we can improve both at the same time.**

– if we have to use MA200, we would use the MMType=6; that is a good balance that improves maxDD, but doesn’t significantly decrease the PV (it went from 5.38 to 4.55)

– **interesting that using the MMType=8 (50% increase over the MA) was worse in PV than the MMType=7 (50% decrease under MA, 50% increase above MA). This is not what we got in the previous post, in the Naive Learner case. Actually, this is what we hoped for. The MMType=7 (by decreasing leverage under MA) could improve both the PV and maxDD compared to MMType=8!** That is amazing! We decreased the leverage and improved both PV and maxDD. we usually don’t expect this to happen; only hope it.

– observation on MMType=5. (0% leverage under MA; a stop loss); Based on previous post PV chart and on the current post PV chart; it is not proved to us that we should use MMType=5, and cut investment to zero under MA. In the Naive Learner case (previous post), all the PV for MMType=5 were lower than MMType=6. And the maxDD doesn’t improve significantly. This is a tradeoff, but we don’t think the additional small maxDD improvement worths that we are left out from the profit so much. Therefore, we don’t suggest MMType=5 in general.

– **Using the MA(50) **and if we compare it to the original (leverage=1) version, we can contend:

**1. If we target the same PV=5.38 (as the original), but we want to decrease the maxDD, use MMType=6, MA(50): similar PV (=7.76, even better PV!!), but maxDD went from -59.89% to -47%
2. if we target the same maxDD=-59% (as the original), but we want more profit, use MMType=7, MA(50): similar maxDD (-57.72%), but PV went from 5.38 to 21.31 (4x times more) **

Conclusion:

– **Question: Does money management works not only for the Naive Learner strategy, but for the Buy&Hold strategy too?
Answer: Yes, it works very similarly** as in the Naive Learner case. It works well.

– We expect that there is no general MA (either 50 or 200) that works for every strategy. We reckon that **every strategy has its own optimal MA**. So, it was strange to us that we find that the same MA, the MA50 seems to work best for the Naive Learner (short&long) strategy as it works best for the Buy&Hold (only long) strategy. Probably if we refine these numbers (by backtesting the MA40, MA60, MA180, etc.) we could find differences on the optimal MA, but it is interesting to **observe that trends in both strategy (losing or winning trends) last for about the same 50 days**. 50 trading days constitutes to about 2.5 calendar months.

– **for RUT Buy&Hold, we would not suggest to use the MA200 as a guide for a general Stop Loss strategy. However, this is general among investors in real life. This test show that MA50 is better, but we haven’t optimized this parameter value.** We haven’t tried other MA values. However, we reckon that other indices (SPX, Hang Seng) will have other optimal MA value. **In spite of this test showed that MA50 is the best for this case, in general we tend to suggest using MA100 for Buy&Hold strategies. This will minimize whipsaws (less failed signals), and it should still give PV similar to the MA50 case. So, for a general Buy&hold strategy (long only), we would suggest using MA100 and MMType=7 (50% decrease under MA100, 50% increase above MA100). This would assure at least triple PV, with about the same maxDD as the original Buy&hold strategy.** And because we switched to MA100, very few extra trades (negligible commissions and friction costs) will occur.

]]>

To study position sizing, we omitted using NN, because NN introduces some kind of randomness into the study. We wanted to show some results without that stochastic nature, so **we use the 2 bins Naive Learner** in this post (introduced in the first post in this year). This agent has a lookback of 200 days, **makes a statistics about what was the average ‘next day profit’ on Up and on Down days and it gives forecast for next day %gain based on this statistics. Note that this algorithm is adaptive, it has noticeable profit (29% CAGR) and most of all it is deterministic.**

We usually don’t like using stop loss strategy, because in every short-term strategy, it induces loss in backtest.

We haven’t seen any backtest that showed a profit gain. Most of the time the advantage of stop loss level is as hedge to reduce the big drawdown.

Recently **Jezz Liberty** proved us that the stop loss method is very important. **Playing random entry points (blind monkey entries), but using trailing-stop losses, he achieved 18% CAGR. **

His main motto came from David Harding:

**”
If you put in stops and run your profits and trade randomly you make money; and if you put in targets and no stops, and you trade randomly you lose money. So the old saw about cutting losses and running profits has some truth to it.
“**

See the details here:

http://www.automated-trading-system.com/trend-following-monkey-style/

And here:

http://www.automated-trading-system.com/further-musings-on-randomness/

**Stop loss is a kind of money management. There are other techniques to manage positions as well. The main point here is to avoid big losses**. For example, another money management technique is playing the equity curve.

**The basic idea is that we reduce the position to half or zero if we are under a MA (200 days MA) of the equity curve = portfolio value.**

We insert another relevant quote from this site

http://cortex.snowcron.com/forex_nn.htm

”

**We can see, that winning and loosing trades are going in series. This certainly can be used to improve out results in one of two ways.
First: we can use the profit curve to adjust the size of the lot. way we will use larger lots during winning periods, and smaller lots during times, when our system does not perform well.
Of course, this approach will not work, if we have series of loosing trades, with single winning trade between them, so it is necessary to do a careful study of the profit curve.**

This approach is an example of money management strategy, and it can improve some trading systems dramatically.

”

For this we have to **track the theoretical portfolio Value (TPV) separately from the Played Portfolio Value (PPV).**

**The MA should be calculated based on the original ‘theoretical’ PV.** If not, imagine a situation that the Played PV dives under the MA(200). Now, it is under the MA(200), its position is reduced to half or zero.** If the position is reduced to half, it will take a lot of time to gain enough profit to move above the MA(200), since all the positions, even the winning positions are reduced by half. **

In the case that the position is reduced to zero, the strategy can never go above the MA(200), since it will stop accumulating profit or loss, it stays constant. After a while, the MA(200) will move down to this constant PPV, but this is not how we want to play this strategy.

We made a mistake in our first backtest implementation and didn’t notice that how important is that ‘playing the equity curve’ strategy should keep two versions of the PV: the TPV and the PPV.

We defined different money management types (MMType). This code illustrates the possibilities:

Note:

**– MMType=5 is a virtual stopLoss strategy; exiting the position fully under MA
– MMType=6, is a semi-stopLoss strategy, having 50% in the position under MA
– some MMType methods can not only decrease the positions when we are in a loosing period, but can increase the position by 50% when we are in a winning period (above the MA).
– the MMType 5, 6, 7, 8 has another important parameter: the Moving Average Lookback period. That we varied as 1, 5, 50, 200, 1000.**

The original naive learner gives the following equity curve: (leverage = 1 (always))

This chart shows the Portfolio Value, so the chart starts from $1.

We use** red line to illustrate the MA(200) **to have an intuitive feeling about those periods when the money management methods reduced the position.

The original naive learner gives the following equity curve when we **boost it with leverage = 2.** (with double, or Ultra ETFs, 2 times daily leverage can be played on a simple (even IRA) account; no margin account requirement is needed)

That is something that can make you a millionaire. **Gives an amazing 60% CAGR, but the pain is significant: -78% DD. And that DD lasted for 7 years. That nobody can bear. Imagine that if you started the strategy exactly on the top in 2000. In 2007, you have -78% loss. That is a pity, because in 23 years this strategy gives $34,000 for every $1 invested. In another scale: you invest $1K, you got $34M in 23 years. You are more than a millionaire. But let’s forget this fact.
It is only a theoretical profit; in hindsight; with some parameter optimization (200days lookback), and as we said, there is no investor in this world who can bear the pain of -78% loss in 7 years and continues this strategy.**

Let’s see **when we have 50% reduced position under MA:** (MMType=6)

MA(50):

MA(200):

Note how nicely the maximum loss, the DD is reduced by this technique. (Compared to the original, leverage = 1 case)

Let’s see when we **have 50% reduced position under MA and 50% increased position above MA**: (MMType=7)

MA(50):

MA(200):

Note that with about the same maxDD, the PPV increased significantly. (Compared to the original, leverage = 1 case)

And the whole performance measurements in tables: (PV, maxDD and CAGR charts)

Some notes:

**-Between the MAs, the 50 days MA is the winner.**

**– Using the MA(50) and if we compare it to the original (leverage=1) version, we contend:
1. If we target the same PV=295 (as the original), but we want to decrease the maxDD, use MMType=6, MA(50): same PV, but maxDD went from -47% to -34%
2. if we target the same maxDD=-47% (as the original), but we want more profit, use MMType=7, MA(50): same maxDD, but PV went from 295 to 3,521 (10x times more) **

Conclusion:

– **Question: Does money management works?
Answer: Yes, it works;** the only problem is: it is difficult to define the ‘optimal’ parameter: which MA use? (for this task, for this period it was the 50 days MA, but for other tasks, other periods the magic number will differ)

– **Which MMType should we play?**

**Risk taker traders should prefer MMType=7 to ‘maximize’ gain; conservative investors should prefer MMType=6 to minimize the maximum loss (DD).
**

]]>

– **Two of the most useful ways to standardize inputs** are:

o **Mean 0 and standard deviation 1** (we call it StD normalization)

o **Midrange 0 and range 2** (i.e., minimum -1 and maximum 1) (we call in MinMax normalization)

-NN FAQ

We highly recommend for the reader to study these link from the **Neural Network FAQ**, in which** a statistician tries to answer the question:
“Should I normalize/standardize/rescale the data?”**

http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html

and here are the illustration images:

http://www.filewatcher.com/b/ftp/ftp.sas.com/pub/neural.0.0.html

must be in the interval [0,1]. There is in fact no such requirement,

although there often are benefits to standardizing the inputs as discussed

below. But it is better to have the input values centered around zero, so

scaling the inputs to the interval [0,1] is usually a bad choice.

“

– we have to mention that we don’t want to subtract the mean, because we lose important information. (We tested it half a year ago in Matlab; we posted an article about normalization then)

from NN FAQ:

”

Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. ”

We reckon that subtracting the mean from the inputs changes a very important aspect: even a tiny shift can change an Up day to a Down day. And we think people treat emotionally differently the yesterday’s Up days even if they are up only a tiny amount. This is the opinion of other bloggers too (Micheal Stockes). Therefore, we never want to convert an Up day into a Down day category. At least, our backtest in Matlab (half a year ago) showed that we shouldn’t do that.

-Inputs in the range of -0.1..0.1 is bad for you.

Note this quote from the NN FAQ that applies to our case for example as our case (e.g. our currDayChange input of 1% was represented as 0.01 as a real number):

“It is also bad to have the data confined to a very narraw range such as [-0.1,0.1], as shown at lines-0.1to0.1.gif, since most of the initial hyperplanes will miss such a small region. ”

To illustrate why, see a couple of images:

Good distribution of initial hyperplanes:

ftp://ftp.sas.com/pub/neural/lines-1to1.gif

Bad distribution of initial hyperplanes: (the training will be slow and more chance to stuck in minima)

ftp://ftp.sas.com/pub/neural/lines-0.1to0.1.gif

“Thus it is easy to see that you will get better initializations if the data are centered near zero and if most of the data are distributed over an interval of roughly [-1,1] or [-2,2].”

“The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers. ”

– in this post we compare minMax and Std Normalization. Note that in this study, we didn’t re-center that data. Only scale (multiplication) was applied.

We have 2 parameters: inputBooster and outputBooster.

In the minMax normalization case, inputBooster = 10 means that all the input was multiplied by 10. We wanted an almost equivalent case to the StD normalization. As a trial, we found that the almost equivalent multiplier in the StD normaliziation was if the divide 10 by 7; This gave the base for comparision of the MinMax and StD normalization.

– Some parameters of the backtest:

int nRebalanceDays = 1;

int lookbackWindowSize = 200;

double outlierThreshold = 0.04;

int nNeurons = 2;

int maxEpoch = 20;

```
```

`int[] nEnsembleMembers = new int[] { 11 };`

int nTest = 7; // per cell

In the figures, the maximum values are bolded. The Average of the rows and columns are also presented.

See the TR (Total Return) chart for:

for the MinMax:

for the Std case:

We may compare the D_stat (Directional accuracy) for

for the MinMax:

for the Std case:

– if you look at the minmax versions in the dStat, you never find values above 57.0%. However, in the StD normalization case, there are 3 places where it is above 57%. That is a very weak proof that STD normalization is better than minmax.

– Another reason why we like the STD normalization:

Our inputs and outputs are not ranged variables, but random normal distribution variables. Because our input is currDayChangePercent, that is a random variable, we cannot determine the max. and min values. Therefore, it is better to use standardization (normalize to standard distribution). In another prediction task, when we would use the dayOfWeek as an input, (as that input is not a random variable; the max and min values are clear before running the process: 1 for Monday; 5 for Friday), we would use minMax normalization, not Std normalization.

– The output is always a random variable; we better use STD normalization.

– TR table, InputBooster:

For MinMax normalization, the best input booster is 10 (minmax2)

For StD normalization, the best is the inputbooster=1.4 (std3)

– TR table, OutputBooster:

For MinMax normalization, the best output booster is 50 (minmax2)

For StD normalization, the best is the outputbooster=28.57 (std3)

– One problem we didn’t understand at first: the output has to be scaled up by 50 to be optimal. Weird, isn’t it?

The question was raised. **Can the NN generate outputs above 200?**

For example: with outputBoost=100, almost all training output will bigger than 1; Our activation function however can generate values only in -1..+1.

– At first, we thought we equivalently convert outputs to +1,-1; maybe it is equivalent **to learn the sign as output. ****Backtested: No!, learning Sign() output only results TR =25,000, 11,000 (in 2 backtests).**

– Our problem solved: **NN can generate output = 200, because our output layer activation function is linear; (we have 2 neurons) So, the output = ActivationFunction(bias_output * weight_output + neuron1_output*W_neuron1 + neuron2_output*W_neuron2); Our activation function is linear; our weight can be anything; so our output layer can generate values above 200;** Note, that our neuron1 and neuron2 output are ranged into -1..+1 (because of the tanh() activation function); If we have a proper training the output scaling shouldn’t count. Still, the optimal outputBooster is 30 * the STD. That is still a mystery, why it is the case.

From the NN FAQ:

“If the target variable does not have known upper and lower bounds, it is not advisable to use an output activation function with a bounded range. You can use an identity output activation function or other unbounded output activation function instead; see Why use activation functions? ”

This is exactly what we used: having no activation function, our output is not bounded.

**– We shouldn’t mention that TR=87,000 performance means, it multiplies the initial $1 by 870 in 23 years (35.46% CAGR). The reason is that these kind of backtest results are very much parameter fine tunings and fine tuned successful past parameters can never give the same ‘best’ result in the future.** So, this very successful result is only theoretical.

– as a curiosity, we also tried **to cap (clip, clamp) input values larger than 1 STD away to be 1 STD.** Our backtest show, it didn’t perform as well. **It showed -5% to -20% TR performance compared to the case in which we only did StD normalization, but we didn’t clip the inputs to 1 STD. **It shows that by capping inputs, we lose precious data that can be useful in the prediction. So, in spite of this is mentioned in a couple of places, we don’t advice to use it.

– Conclusion:

**we would like to use the StD normalization instead of the minMax normalization, because
– we find it a little bit better as performance,
– it is more suited to our input (if our input is a ‘normal’ distributed random variable), so the max, min values cannot be determined
– it is generally suggested by the literature
– we will use inputBooster = 1 (for the Std normalization).
– we will use StD normalization for the output too. With outputBooster = 30;** We also note that this seems to be a parameter back-fitting (in hindsight); This value showed best in backtest; The reason of its surprisingly high value could be numerical precision issues in the special training process.

]]>

As the activation function, we cannot use the sigmoid function, because it is limited only to the 0..1 output range. A commonly used function is the hyperbolic tangent. Its derivative is easy to compute, which come in handy in the training algorithm.

We target 3 layer networks. The number of neurons in the input and output layers are fixed to our problem. As we work with the daily currentChangePercent as input and target nextChangePercent as output, the numbers of our input and output neurons are fixed: 1 and 1. (Note: we do regression and not classification, so our output is also 1 dimensional. In classification, we can have multidimensional outputs easily.) We have some freedom to choose the number of neurons for the middle layer.

For this study, it will be also fixed: 2.

Other parameters of the algorithm:

int lookbackWindowSize = 200;

double outlierThreshold = 0.04;

int nNeurons = 2;

int maxEpoch = 20;

int[] nEnsembleMembers = new int[] { 11 };

We have the freedom to choose the activation functions and the bias in each layer. We experiment with 3 different NN structures. **3 different activation function choice:
1. Encog FeedForwardPattern(); (Tanh, Tanh, Tanh) layers, bias in input layer
2. Matlab emulation; (Lin, Tanh, Lin) layers, no bias in input layer
3. Jeff network (Tanh, Tanh, Lin) layers, no bias in input layer**

For robustness, we also vary these normalization boost parameters as 0.1, 1, 10 and 100:

double inputNormalizationBoost = 0.1;

double outputNormalizationBoost = 0.1;

The code that handles this:

if (p_nnStructure == 1)

{

// most popularly used in Encog examples

// the default Generate() gives:

// 3 layers has TanH() activation (stupid), only the middle or maybe the output layer should;

// and the last layer has no bias, but the first and second have

// the middle layer has 2 biases (the biasWeight is 2 dimensional)

// the biasWeights are initilazed randomly

FeedForwardPattern pattern = new FeedForwardPattern();

pattern.InputNeurons = 1;

pattern.OutputNeurons = 1;

pattern.ActivationFunction = new ActivationTANH(); // the Sigmoid cannot represent negative values

pattern.AddHiddenLayer(p_nNeurons);

network = pattern.Generate();

}

else if (p_nnStructure == 2)

{

// version 2. Matlab emulation;

// consider the newFF() in Matlab:

// 1. The input is not a layer; no activation function, no bias

// 2. The middle layer has a bias, and tansig transfer function

// 3. The output is a layer; having a bias (we checked); but has Linear activation (in the default case); in the Matlab book, there are examples with tansig ouptput layers too

network = new BasicNetwork();

network.AddLayer(new BasicLayer(new ActivationLinear(), false, 1));

network.AddLayer(new BasicLayer(new ActivationTANH(), true, p_nNeurons));

network.AddLayer(new BasicLayer(new ActivationLinear(), true, 1));

network.Structure.FinalizeStructure();

}

else if (p_nnStructure == 3)

{

// Jeff use it

// “ I’ve been using a linear activation function on the output layer, and sigmoid or htan on the input and hidden lately for my prediction nets, and getting lower error rates than a uniform activation function. “ (uniform: using the same on every layer)

network = new BasicNetwork();

network.AddLayer(new BasicLayer(new ActivationTANH(), false, 1));

network.AddLayer(new BasicLayer(new ActivationTANH(), true, p_nNeurons));

network.AddLayer(new BasicLayer(new ActivationLinear(), true, 1));

network.Structure.FinalizeStructure();

}

In the previous blog we mentioned these differences between Matlab and Encog:

-Encog default 3 layer Backprog network has extra bias/and neurons. It differs from Matlab. More specifically:

A.

The newFF() in Matlab network is:

1. The input is not a layer; no activation function, no bias. It has only 2 layers (the middle and the output)

2. The middle layer has a bias, and tansig transfer function

3. The output is a layer; having a bias (we checked); but it has Linear activation (in the default case); in the Matlab book, there are examples with tansig output layers too

B.

The default FeedForwardPattern().Generate() in Encog gives:

1. 3 layers has TanH() activation. We found it weird. Only the middle or maybe the output layer should have activation functions.

2. the last layer has no bias, but the first and second have

3. the middle layer has 2 biases (the biasWeight is 2 dimensional) in case of 2 neurons case

4. the biasWeights are initilazed randomly; Correct. That is expected.

Here are the measured results. Note we also measured the average network training error (avgNerr column), but there was no significant difference between the different NN structures.

Take the average gCAGR (Geometric Commulative Annual Growth Rate) as a performance measurement. But the TR (Total Return) can be used as well.

Here are our measurements: (click them for a non-blurry, non-scaled version)

For normalizationBooster = 0.1:

For normalizationBooster = 10:

For normalizationBooster = 100:

**Conclusions:**

– we made 4 tests for robustness (for different normalization boost levels).

– Using the average gCAGR values we can say the nnStructure1 was a winner 2 times, nStructure2 was winner 1 times and nStructure3 was winner 1 times.

– Using the average TR as a performance measure, nnStructure1 was a winner 1 times, nStructure2 was winner 2 times and nStructure3 was winner 1 times.

– if we take into account the SD (standard deviations), the decision boundary is even more blurry

– **we conclude that nor gCAGR neither TR and not even PvMaxDD can differentiate between the 3 network structure**

– using the avgNetworkTrainingError also cannot differentiate between the network structures

– overall, **it doesn’t matter which structure we use. They are almost equivalent. So, this study has not too much result. We wanted to find the best NN structure and we cannot find it, because all of them are equally good.
– However, as a decision has to be made, we prefer structure 2, because linear input and output layers makes most sense to us, and assures simplicity. As Einstein said: Make everything as simple as possible, but not simpler. (Also Occam’s razor: a principle that generally recommends selecting the competing hypothesis that makes the fewest new assumptions, when the hypotheses are equal in other respects.) Also we like to omit using bias in the input layer. **

– also, I am not sure Encog uses the activation function of the first layer, so it doesn’t matter which we choose. (We made a separate study here in the postscript of this blog post. Geeks should see that too).

**– note a fallout of this study: the normalization boost parameter optimal value is around 10. **This tells us 2 things.

1. The optimal boost is bigger than 1. It tells that max and min input values shouldn’t be crammed exactly into the range of -1..+1. So, this regression task prefers not to use the usual minMax normalization (when the min value is mapped to -1 and max value is mapped to +1). This problem prefers to use the stdDev kind of normalization, when the value 1 standard deviation away from the mean is mapped to -1 and +1. So, values outside of the SD can be mapped to larger values than +1.

2. The optimal boost is lower than 100. So after a while, numerical problems occur. The NN preferred output range is -1..+1 and not -100..+100. After extending too much, it cannot forecast precisely into that output range. A solution is that we may increase the input boost, but keep the output boost small. (but that is another study)

**– as we see now that the normalization boost = 10 is better than the boost = 100, we are happy to conclude that we got even better results than in the last blog post. In the last blog post we only tested boost = 1 and boost = 100.
– These better result achieved by boost = 10 is:
The average TR= 60,000% (Total Return). (It multiplied your initial 1 dollar by 600! In 23 years). That is about 33.4% CAGR (per year). The directional accuracy is 56.9%. We conclude again that, in such a long term period (23 years) like this, this is our best prediction; best result so far.
– Compare it to the Buy&Hold approach. Buy&hold multiplied the initial capital by 5 in 23 years (7.9% annual CAGR). Encog NN multiplied it by 600. That is 120 times more.**

Next time, we will investigate a little money management technique to possible improving the profit potential. We will increase or decrease the position size, based on our conviction.

********************

********************

********************

********************

**Bonus postscript for Encog programmers:** (this chapter is only for geeks)

We wanted to check how the Activation function and bias for the input layer is used in Encog or not. Are they used at all? **We traced the XOR example** application that can be found among the Encog framework examples.

**1. Forward propagation, the Compute()**

In Compute(), the activation function of the first layer is never used.

public virtual void Compute(double[] input, double[] output)

{

int sourceIndex = this.layerOutput.Length

– this.layerCounts[this.layerCounts.Length – 1];

```
``` EngineArray.ArrayCopy(input, 0, this.layerOutput, sourceIndex,

this.inputCount);

for (int i = this.layerIndex.Length – 1; i > 0; i–) //process only 2 layers

{

ComputeLayer(i);

}

` EngineArray.ArrayCopy(this.layerOutput, 0, output, 0, this.outputCount);`

}

Note that with this instruction

EngineArray.ArrayCopy(input, 0, this.layerOutput, sourceIndex, this.inputCount);

the **input is simply copied to the layerOutput; without using any activation function. **

So, it says that the output (the target array is this.layerOutput) of the first layer equals exactly to the input of the first layer (without any modification, any weight usage, or activation function usage).

That may be a bug in Encog. That may be on purpose. (Probably it is the intent.)

**For all other layers, the this.layerOutput array contains really the output. That means the layerOutput = activationFunction(Sum(weights*(input+bias of that layer))) of that layer. The input layer is the only exception. The input layer doesn’t exist.**

Also, we noticed that the code

EngineArray.ArrayCopy(input, 0, this.layerOutput, sourceIndex, this.inputCount);

**copies only the input values (=2 inputs in the XOR problem) irrelevant of the fact that we specified bias or not in the input layer. Therefore, even the bias of the input layer is never used in Computation().**

**2. Training**

However, the derivative of the activation function of the first layer is used in Train() in

private void ProcessLevel(int currentLevel)

{

…

this.layerDelta[yi] = sum

* activation.DerivativeFunction(this.layerOutput[yi]);

}

We also made an experiment: replaced the derivative by Double.NaN.

public virtual double DerivativeFunction(double x)

{

//return (1.0d – x * x); // orig

return Double.NaN; // Agy

}

The result is that **even if the derivative was called, the NaN was not propagated into the calculations**, and the XOR example returned proper, flawless calculation. (as if those NaN values were not used) So, we are convinced that even if the derivative is called, its value is not used later. However, this means, that Encog does unnecessary calculations (not very efficient), and it also means that in spite of that we have to specify the input layer as a proper layer (in contrast to Matlab), and the input layer has bias, and activation function, they are not used. So Encog virtually has only 2 layers: middle layer and output layer (as Matlab has).

Therefore, because in Encog it is irrelevant what the activation function and bias are in the input layer, it is no wonder that our performance study found no difference between structure version 2 and version 3, because the difference between the 2 structures is that version 2 used linear activation, while version 3 used tanh activation in the input layer.

**The disturbing fact is that the Encog programmer has to specify the activation and bias for the input layer, and the user believes he makes a cardinal decision which can greatly affect the output or the prediction power. However, his choice makes no difference.**

We guess Encog cheat the programmer, because it simpler to implement the input as a general layer, however in practice, it is an empty layer. Beware of the delusions!

]]>

Before answering that question, we amend our previous post. In the previous post, we contend that a test in Matlab runs 35 minutes. That was true, but we also mentioned that that Matlab program does a lot of extra things like outlier elimination, handling VIX and EUR data, etc. For this post, we weeded out the unnecessary Matlab code parts, so now the Matlab and Encog code are exactly the same in functionality.

For speed comparison, we run exactly the same task for Matlab and Encog this time, and that is the 2bins input and 2bins output case. This is a different task than the one previously tested. It is much simpler, quicker to execute. That is the reason why even the current Encog measurements differ from the previous Encog measurements. The comparison is more faithful now. That is for this simpler prediction task, using 1 random sample in the ensemble:

– Matlab: 11 minutes (instead of 35 minutes as stated earlier) = 660 seconds (it uses only 1 core (inspected by the TaskManager))

– Encog (single thread): 8 seconds (in theory, the training algorithm is multithreaded)

– Encog (days run parallel on 4 core PC): 3 seconds

**In practice, we always run the Encog backtests in parallel mode, so the 660seconds to 3 seconds increase is still a 200x fold speed increase. Thank you Encog!**

Let’s see the different strategies we **run for the last 23 years, lookback days: 200.**

– The Buy&hold strategy forecasts an up day for all days.

– The Deterministic MR (Mean Reversion) strategy forecast an up day if the current day is a down day (and vice versa).

– The Deterministic FT (Follow Through) strategy forecast a down day in the current day is a down day (and vice versa).

– The Naive Learner uses the last 200 days as input. It calculates the probability distribution of the input based on binning the input into 2 or 4 bins. It gives back the average in that bin.

For Encog and Matlab we tested 3 different cases

– the unnormalized input, output case

– 2bins: the Sign() discrete inputs; that basically converts the input and output to -1 or -1. That is some kind of basic ‘normalization’

– the normalization of the input and output to the [-1..+1] continuous range

(sometimes, we boosted the normalization with a constant boost value. In case of boost=100, the range becomes [-100..+100].

We show some portfolio value charts.

See the RUT Buy&Hold for the last 23 years period.

Note the last period from 1998:

The deterministic FT chart here. Note the -90% drawdown:

Note the last period from 2010:

The naive learner, 2bins input case:

The naive learner, 4bins input case:

The Encog NN learner with 2bins input 2bins output:

The Encog NN learner with normalization boost 100:

And we show some performance numbers. The meaning of the columns:

-CAGR: Cumulative Annual Growth Rate

-TR: Total Return

-D_Stat: directional accuracy of the forecasts

You have to click the images to zoom it to see them nicely.

For the non NN (Neural Network) strategies:

During our quest we learnt some **differences of Matlab and Encog:**

–**Encog doesn’t auto-normalize input or output, while Matlab does (that is why Matlab prediction is quite good, even without any normalization)**

-there is no tansig() activation function in Encog (that is the default in Matlab), but there is a similar one TANH in Encog.(Note from Matlab documentation: ‘tansig() is mathematically equivalent to tanh(N). It differs in that it runs faster than the MATLAB® implementation of tanh, but the results can have very small numerical differences.’ For us, it means the two are the same.

-There are different learning algorithms, Matlab uses trainLM (Levenberg-Marquardt) by default and Encog uses ResilientBackpropagation by default. Encog doesn’t have trainLM.

-It seems that Encog gives back the correct value of the estimation. For example, if directions (+1,-1) were learnt, Encog gives +1 or -1 (or very close to that) as a prediction. Matlab gives the average, like 0.04 and -0.04.

**-Encog default 3 layer Backprog network has extra bias/and neurons. It differs from Matlab. More specifically:**

A.

The **newFF() in Matlab network is:**

1. The input is not a layer; no activation function, no bias. It **has only 2 layers (the middle and the output)**

2. The middle layer has a bias, and tansig transfer function

3. The output is a layer; having a bias (we checked); but it has Linear activation (in the default case); in the Matlab book, there are examples with tansig output layers too

B.

The default **FeedForwardPattern().Generate() in Encog gives:**

1. **3 layers has TanH() activation.** We found it weird. Only the middle or maybe the output layer should have activation functions.

2. the last layer has no bias, but the first and second have

3. the middle layer has 2 biases (the biasWeight is 2 dimensional) in case of 2 neurons case

4. the biasWeights are initilazed randomly; Correct. That is expected.

**Conclusions after studying the performance numbers:**

– the Buy&Hold gives 430% profit in 23 years. That is about 7.96% CAGR. It multiplied your initial 1 dollar by 5 in 23 years.

**– the deterministic MR is a disaster. It worked from 1998 only.
– the deterministic FT gives 18% CAGR (TR = 3,657%), but the maximum DD of -90% makes this method a disaster (it lost -90% from 1998)
– the 2bins case naive learner is quite good: 29.5% CAGR. (TR = 30,000%). It multiplied your initial 1 dollar by 300 in 23 years. It is also a learning algorithm. Adaptive. So, even without the power of the NN (neural networks) it is worth considering.
– the 4bins naive learner (TR = 20,000%) was worse than the 2bins naive learner (TR = 30,000%).
– the NN based machine learning algorithm (TR = 45,000%) could beat the naive learner. That is a very good message to us. So, it is worth using the NN.
**

**– the non-normalized, no binned Encog NN is a disaster, because the input values are very small: 0.1% change of the RUT induce an input of 0.001. These lead to numerical errors during training.
– the 2bins case NN achieved in average about TR= 8,246%. (It multiplied your initial 1 dollar by 100). That is about 20.94% CAGR.
– however, the properly normalized, and normalization boosted Encog NN shines. It was even better than the Matlab NN.Note that the average TR= 45,000%. (It multiplied your initial 1 dollar by 450). That is about 32% CAGR. The directional accuracy is 56.5%. In such a long term period (23 years) like this, this is our best prediction; best result so far.
– Compare it to the Buy&Hold approach. Buy&hold multiplied the initial capital by 5 in 23 years. Encog NN multiplied it by 450. That is 100 times more.
**

– To refresh our mind. This charts show again that why it is worth using an adaptive learning algorithm instead of a static one (like DV2, RSI2, overbought signals, etc.). **The deterministic algorithms (daily MR, FT) cannot cope with the change in the world.** Any fixed, static algorithm (daily FT) that was a winner before 1998 had a 12 years losing period after 1998. Something happened in 1998. We don’t know what, but the financial world has changed. The daily follow through is substituted by daily mean reversion. A rigid, static strategy would fail. In contrast, **any learning algorithm (naive, NN, SVN, etc.) could adapt very well to the change of the world** and could be a winner strategy even after 1998. That is why we have to continue using machine learning. To adapt to the change of the world.

– Note also that this input (currDayChange) per se is not sufficient for us. It shows 32% CAGR, which is very high, but note that too much portion of the gains come from 2008, a black swan year. This is the year, in which because of the financial turmoil, any daily MR strategy beat the market. Because of this fact, we wouldn’t use this input, this strategy alone. We would like to combine it with other inputs to have a more reliable (not more profitable) strategy.

**Most important conclusion of this post:**

**Encog and Matlab behaved about the same.** When one had predictive power, the other also had and the magnitude is about the same. (in the 2bins case, Matlab gave 8842% TR in average, Encog gave 8246% TR). Therefore, we can contend the prediction power of the Encog NN and the training algorithm and **the whole Encog framework is reliable. We will use Encog in the future. One warning only: be very careful and normalize the data.** We learnt that small range input is unacceptable and Encog doesn’t do automatic normalization.

]]>

Happy new year. This is our first post in 2011.

A serious obstacle we had during our research last year is the backtesting time. All the code we wrote was in Matlab. Consider that backtesting only 10 years requires approx. 10 minutes in Matlab with only 1 random sample. Note that instead of using 1 random forecast, we have to use about 5 or 10 to have a reliable projection. So every single backtest run takes about 100 minutes. When we are interested in how a parameter change affect the performance, we want to test 10 different parameter values, 10 different backtests. It can take 10 times 100 minutes. 1000 minutes is 16.6 hours for fine tuning a parameter. It took me days and nights to run the backtests. And usually I couldn’t backtest 20 years, only 5 years. Not to mention that many times I had to wait 1 or 2 days to evaluate the result of a new idea. That is simply annoying.

I stumbled upon a promising Machine Learning framework called Encog that has native Java and C# implementation.

here

**This post is about comparing the speed of Matlab and Encog**.

I believe Matlab is one of the best platform for mathematicians. Compared to other packages, Matlab is quite fast. see here At least it is 2 times faster than the Matlab Compatible open source Octave.

Matlab speaks the language of mathematicians. However, it is an interpreted language and it is inherently slow if you compare it to more native computer languages like C++ or C# or Java.

The advantage of Matlab:

– very well written. Paid product, so the authors has some responsibility to keep their code fast and bug free.

– it contains many useful math functions, ready to use, tested, reliable

– it has in-the-box chart making capabilities

– the final source code is very concoise, easier to read

Disadvantage of Matlab:

– I significant drowback of using Matlab is the execution speed.

– debugging (watching the variables in real time) is not very sophisticated

However, I am natural born programmer, so I shouldn’t be concerned about using C# instead of the Matlab script language. Even if expressing the same thought into machine code takes much more line of source code to write in C# than in Matlab.

Here are the time required for a backtest that used only 1 random sample per day, backtesting time from 1987 to 2011 (about 23 years).

Lookback period is 200 days. The MaxEpoch is 5 in Matlab and 20 in C#.

Time measurements: (for last 23 years, equals 5700 forecasts)

– **Matlab: 35 minutes = 2100 seconds**

– **Encog (single thread): 12 seconds** (in theory, the training algorithm is multithreaded)

– **Encog (days run parallel on 4 core PC): 6 seconds**

Note that the Matlab version calculates CAGR and TR too, and do some minor extra calculations (outlier elimination) which are not implemented in the C# version.

The C# version is really very simple. However, I don’t think this seriously distorts the result.

**The timing measurements say that Matlab neural network training runs about 200 times slower.
Let’s just assume a 100x increase in case the Matlab and Encog version would calculate exactly the same things**. It would mean that a previously 1 day backtest in Matlab would run in only 15 minutes in C#. What a relief.

Note the possibility of using Encog in the Rackspace cloud.

here

The RackSpace ‘small machine’ cost only 1 cent per hour to rent.It is very cheap to run long backtests in the cloud. It is definitely worth considering in the future.

**Conclusion:
In speed Encog easily beats Matlab to death. Encog is faster about 100x – 200x times.** But this post is only about speed. The neural training efficiency, correctness is another issue. We haven’t yet checked the CAGR, TR or other financial performance of the Encog predictions. We have a feeling that Matlab is written much better than the open source Encog, so Matlab is more correct in training the neural network. We have to make some tests to assure that Encog makes the same good forecast as Matlab did.

]]>

We didn’t aim to find the most profitable ANN (Artificial Neural Network) approach in the first year, but to learn the small nasty tricks that helps in understanding how the ANN black box work. Trading in real life based on decisions by a black box is emotionally very hard. Especially after some serious drawdown it is easier to deem the prediction of the ANN to be silly and stop trading the strategy. After one year of scrutinizing different ANN trainings, reading academic articles and books, tweaking parameters, visualizing inputs, outputs, we don’t believe any more that the ANN is a black box. Of course, its decisions are not disclosed in terms of rules, but we got familiar with them at the point that we can even predict their prediction.

Just roughly reminiscence the history of this blog. We started with the simplest of all, using only the previous day return for next day prediction. After that, we tried 2 Moving Averages inputs (short term, long term). Later we tried the VIX as an input. That didn’t work. We changed later the daily prediction to weekly hoping the less volatility will be better for us, but it didn’t work either. From the second half of the year, we started to investigate the day of the week as an input. This seemed to work. Later we tried the current day return as input and at last we moved on studying the Euro (FXE) return on the next day index return.

This post is also unusual, that it is the culmination of the last 3 main topics. **The different inputs: day of the week, current day return, Euro daily return are combined finally.** So, we can say this post is not only a chronological milestone, but it is a milestone content too.

As a preliminary of this article, you may visit another post here in which we combined the dayOfTheWeek and currDayChange inputs for the last 12 years. As you can see from that chart. I repeat here:

That combination was successful (130%-400% TR) for the 12 years backtest, but it treaded water in the last 4 years. (0% TR)

As the FXE started in December 2005, our prediction in this post starts in September 2006, because we use 200 days lookback period for ANN training. So, **we cover about 4 years in these backtests**.

In these tests we used:

**2 bins for discretization of currDayChange and eurChange input (to keep it simple)
1 dimensional dayOfTheweek input**

maxEpoch=100,

nNeurons=2 for the single ANN cases and nNeurons=3 for the combined ANN cases,

lookbackperiod=200

target outlier limit = 4%

1.

**ANN(dayOfTheWeek)**

In the previous post we showed how this input worked successfully with 10% CAGR in the last 12 years, but stopped (0% CAGR) working in the last 4 years.

An example Total Return chart:

We emphasize that it is only an ‘example’ chart. The ANN learning is random (if it is not random, it is wrong, or sub-optimal). The repeated backtest gave slightly different results. This is why we should average.

In another backtest, it **gave 51.34% D_stat, 2.16% CAGR, and -7.86% TR.** We conclude it didn’t work in the last 4 years. Probably, it is due to the 2008 market storms.

2.

**ANN(currDayChange)**

**This is the best standalone strategy. Its result is 52.86% D_stat, 28.61% CAGR, 119.88% TR** in the last 4 years. An example performance chart is presented here:

You may visit our previous post that shows its performance with different discretization levels for the last 12 years. However, note that those performances presented are for the 12 years. For example, the 2 bins case there you can find **13.8% CAGR for the last 12 years, but here we should 28.61% CAGR for the last 4 years part. This shows that how incredible lucky was this approach now, because of the 2008 crash** and the volatility increase improved its return.

3.

**ANN(eurChange)**

Its result is **52.29% D_stat, 7.84% CAGR, 8.04% TR** in the last 4 years. An example performance chart is presented here:

See our first study about it here and second study

Our conclusion that there were sub-periods when it worked, but overall in the 4 years there was a transition that turned the correlation upside down. This affected the performance negatively to the point that overall, it was not profitable.

4.

**ANN (use all 3 inputs)**

**A blind combination** of the 3 inputs gave the following result in average: **53.1% D_stat, 12.94% CAGR, 37.2%.** (see the different experiments results in the table at the end of the post).

An example performance chart:

Note how **the D_stat really improved and it is better than any of the standalone ANNs.** However, its TR is not better than the TR of the best standalone ANN. The reason is that the best standalone ANN was very successful only in about 1 year, in the 2008 period. And it was not too successful in the other 3 years. **The Combined ANN slightly better overall (in terms of D_stat), over the 4 years, but it is not especially good in that 2008 periods when most of the returns earned for the best standalone ANN.** So, **even if it predicts better overall, the profit is not increased. **

In another point of view, one can say that the **combined ANN is a ‘combination’ of the very good CurrDayChange ANN and the other two mediocre ANNs.** So, it is natural to **expect that the very good performance of the best ANN will be muted and the very bad performance of the other two ANNs will be lifted up.** So, a combined performance should be between the best and the worst performances. **The advantage of the combination is that the strategy is more robust, it is averagely good in every periods.**

The 12.94% CAGR is quite decent considering how robust it is. In the long term, we should be happy to play it in real life without hesitation. However, we will show that we can do even better.

5.

**3ANN, trade if all agree, ensemble: 1**

The idea is that **we trade only if all the standalone ANNs agree on the outcome.** **If all the three predicts +1, we play long, if all predicts -1, we play short and we are in cash otherwise. **For measuring its performance we introduce another metrics. (see the table at the end) The WinFrequency is 19.75% that means we are right on 19.75% of the days. The lossFrequency is 16.00%, meaning we are wrong 16% of the days. On 64.25% of the days our profit is zero. Since there is not a single day in the 4 years history of RUT when its daily change was zero, it essentially means, **we decided to be in cash 64.25% of the time.** Note how the **D_stat increased to a whopping 55.24%**. (=19.75/16) The numbers show **12.18% CAGR and 47.26% TR.** However, what we like about this strategy compared to other strategies is the **low drawdown.** See the performance chart:

6.

**3ANN, trade if all agree, ensemble: 11**

We simply repeated the previous experiments with 11 ensemble members instead of 1. However, note that conceptually, it is not a simple ensemble decision any more. The collective decision is not what you expect at the first sight.

We use the next rule for the ensemble forecast:

`ensembleForecast = sum(sign(memberForecasts)) `

The sign of zero is zero, so those ensemble members are not participating. The ensembleForecast result is a positive or negative ‘integer’ number (not a float number). It is only the sign of the aggregated forecasts.

Later, the Trader subsystem uses the ‘ensembleForecast’, and if it is positive, it plays as long, if it is negative it plays as shorts. So, this 11 ensembling result can be positive if ANY of the 11 members (maybe only 1 out of the 11) agrees that all the 3 standalone ANNs has an Up vote.

Suppose that because of randomness, in the 1 member ensemble case (Case 5), instead of the 3 standalone ANNs giving 3 Up votes, there was only 2 Up votes. (because of pure chance) In that case the 1 member version gave a Cash signal. However, this 11 members version will likely give an Up signal (if it is the true case in life), because we repeat the same process 11 times.

**When we repeat the test 11 times, the chances are good that we have 3 Up votes more frequently. ** You can see this effect in the performance table at the end. **With the 11 ensemble case, the zeroFreq (that is the time we are in Cash) was reduced from 65% to 55% because of this. So, this strategy is in cash 55% of the time, nearly every second day. It trades only when it has high conviction about the direction. **We really like this feature. The fallout of it is that we have less risk, because we are less exposed to the market.

So, we can interpret increasing the members to 11 not as the approach to decrease the randomness (as usual in other ANN networks), but as an approach to a little bit lessen the confidence, and to allow more days to participate in the forecast. Instead of 35% of the days as in the 1 members case, we trade 45% of the days. It is important to notice about this ensembling.

Note how the **D_stat increased to 55.96%.** (=24.80/19.51) The performance numbers show **19.03% CAGR and 84.79% TR. **However, **what we like about this strategy is the very low drawdown and the consistency. Higher highs and lower lows.** It is very pretty. With almost 20% CAGR, you cannot wish more.

Finally, see the performance measurements table for all 6 strategies we talked about.

Discussion:

– Do we believe that in these 4 years currDayChange input was especially lucky and its exceptional performance will not continue in the future? Yes. Volatility is likely to decrease, that will trump its performance.

– Do we believe that these 4 years was unlucky for the dayOfTheWeek input, but its shine will return? Yes. It didn’t work, because of the 2008 crash. (it worked in the previous 12 years)

– Do we believe that these 4 years was unlucky for the EUR input, but it will perform better in the future? Yes. These 4 years contained a 180 degree transition period. That was unlucky. We expect the correlation between EUR and USD market will last longer than this (or the turn from positive to negative correlation will be not as sharp as it was now), so even this approach will be profitable in the future.

– As you can see,** in the future we are pessimistic about our so far successful currDayChange ANN, and optimistic about the non performing dayOfTheWeek and Euro ANN. But to be honest, it is not too important what we believe. The market will decide. If the approach works (there is negative or positive correlation between any of those 3 inputs and the target), the ANN will adapt to it. **Even if there is no correlation, and we have no edge in some ANNs, it is enough that we have an edge in only one of the three ANNs. If any of the 3 approaches work, the combination of them will bring up its strength in the long run.

– The **56% directional accuracy, the 19% CAGR and the 85% TR during the last 4 years is currently our best and most consistent result. It is on the market only 45% of the days. Playing by Ultra ETFs would be ideal, because we don’t hold it for long term. So, play with the thought of doubling that return to 38% per annum. (not mentioning the future improvements we plan)** Pretty.

– However, **the best thing about it is not the profitability. Far the best thing about the ‘trade if all agree’ combination is that it gives a very smooth return without significant drawbacks. Higher highs, lower lows. In real life, it is not difficult to play this strategy emotionally.**

There are of course many varieties and many improvements we can do in the future (next year):

– Tweaking the number of bins of the discretization (the one presented here is only the 2 bins discretization. Mainly for illustration.)

– Tweak the weights of the different ANN outputs based on their last 50? days performance. Recently successful ANNs should be given higher weights. This is essentially a linear weighted combination of the ANNs on the meta level.

– Initializing a Meta-ANN as an overmind. This would make it possible a nonlinear combination of the standalone ANNs.

]]>

The D_stat/CAGR/TR numbers were 51.43%/11.97%/126%TR.

Therefore **we were surprised when backtesting it for the last 4.5 years** and tests gave the following result for 2 experiments:

D_stat: 51.34%, CAGR: 5.86%, TR: 0.29%

D_stat: 51.34%, CAGR: 0.74%, TR: -17.86%

**0% TR. **What happened?

Note that the FXE (EUR) ETF only appeared in the stock exchange in December 2005. Using 200 days lookback window, we start the backtest only in September 2006. That means we roughly backtests only the last 4.5 years.

This is the reason why we backtested the dayOfTheWeek input for only the last 4.5 years.

At first, we suspected we made a mistake somewhere, so we repeated the tests for the 12 years. We run 2 experiments (maxEpoch=200, ensembleMembers = 1) and we got this result:

D_stat: 51.87%, CAGR: 9.12%, TR: 86.46%

D_stat: 52.17%, CAGR: 13.27%, TR: 189.96%

So,** the 12 years backtest is profitable indeed. Great.** (neglect the variance now, it is because of using only 1 members)

And let’s see a **chart for the last 4.5 years** backtest:

and **for the last 12 years** backtest:

– We can be relieved that it is not a bug in the program. **It is clearly visible in the 12 years backtest chart that the last 4.5 years (from the 12 years) were not profitable.** We could expect about 0-40% TR, so our first 2 experiments results are validated.

– Unfortunately, it means that **the dayOfTheWeek strategy didn’t work in the last 4.5 years. **That is a good warning to us. **Even if we have a good, sensible strategy that is profitable in 12 years, we can have a 5 years period when it does go nowhere. Emotionally, it is very difficult to play.** After 5 years of going nowhere, we are very tempted to stop the strategy. **Probably, we would have stopped the strategy even earlier.**

]]>

The same in chart form:

Notes:

– The dayOfTheWeek 1 dimensional case is a discretized input (1 dim, 5 bins)

– **Previously, we used maxEpoch=5 generally, sometimes maxEpoch=10. That was quite bad. It doesn’t modify the average of the prediction (our previous results), but it introduced unnecessary randomness we wanted to avoid.**

– **Increasing maxEpoch is a good way of attacking randomness**

– **even if we set maxEpoch to 400, most of the time, the learning exits much earlier (after about 10 epochs). The exit message is ‘Minimum gradient reached’**

– for the combined (2 bins currDayChange, 2 bins EurChange) case, **it usually stops after 7 epochs 90% of the cas**e with the message ‘Minimum gradient reached’.

– **instead of using ensembles with 11 members, it may be less time consuming and better to increase the maxEpoch to 400 and use only ensemble with 5 members. If we increase maxEpoch, the computation takes longer only when needed**, but after the ANN training converged, it will stop performing unnecessary epochs.

– **There is no point increasing maxEpoch from 400 to 1000. **And although none of the forecasted numbers are the same after 8 digits, the std is the same

– **use 200 or 400 as maxEpoch for 1 dimensional input**

– **use 200, if speed is an issue **(you want backtest quicker)

– **the optimal maxEpoch needs to be defined a priori. Different target functions requires different maxEpoch. We expect a non discrete continuous case takes longer to learn. (so, it require greater maxEpoch)
**

]]>

**This is the continuation of a previous post** (see here), in which we studied the discretization of the **FXE input. Our previous backtest was performed without ensembles**, so our backtests were very volatile.

We have only **1 input now, the FXE current day return**. We try to discretize the input into different equal sample size bins. Presenting the data in a nice simple way, we increase the chance that the ANN can learn it. **We backtest 6 years now.**

The parameters of the experiments:

`nNeurons = 2;`

nEpoch = 10;

lookbackWindowSize = 200;

outlierFixLimit = 0.04;

nEnsembleMembers = [11, 0, 0];

note that **we increased the nEpoch from 5 to 10**, because in a previous test, it turned out that it is advantageous.

We made 8 different random experiments.

the non-discretized experiments:

Note the STD here. The 2bins case is very stable. Almost all the experiments give the same result. This is the beauty of discretization. We made the objective function simple. The ANN can learn it easily and consistently. Compare that STD with the STD of the continuous case.

The performance numbers in one table:

And we make a plot of these performance numbers.

We preset a plot of TR for the continuous case. The ANN has to learn the positive and negative correlations between the FXE and the RUT over the years, as it was discussed in the previous post.

Conclusions:

– **This is a surprise result. This test revealed that it is not advantageous to discretize the input.** **This is the opposite result that we got for the CurrDayChange discretization case. For that input even the simplest discretization (2 bins) was better than the continuous version.**

– **the highest D_stat (53.09%) and the highest TR (71.43%) are in the continuous case**. (however, note the STD increases as we increase the bins, and it shows high randomness). This show that **there is some important information that we lose if we discretize** and the ANN could have use that information. At first, we think that t**here may be some outliers in the FXE return input and those are the important information that we erased.** However, this needs to be proved.

– **compare this performance to the performance of the 1 member ensemble case:**

In this **11 member ensemble version, by and large the performance measures are better compared to the 1 ensemble case.** For example in **the continuous version, the CAGR improved from 13.7% to 20.75% and the TR improved from 39% to 71%. (The backtest is performed for 6 years.)**

– **Personally, I would love to use the 6 bins version** (maybe with more nNeurons). Intuitively it feels right to discretize the today change as very oversold, oversold, slightly oversold, slightly overbought, overbought, very overbought. I like these 6 categories. However, **the 6 bins case gave the worst result.**

– The 2 bins case has the lowest randomness. If that is important: stability (for backtest), low nEnsembleMembers, use the 2 bins case. And **the 2 bins case gave a decent result too. If I don’t count the continuous case (the result is maybe some aberration, outliers, etc.), the 2 bins case gives the best result. I guess we will use that in the future. **

]]>

We have

The parameters of the experiments:

`nNeurons = 2;`

nEpoch = 5;

lookbackWindowSize = 200;

outlierFixLimit = 0.04;

nEnsembleMembers = [10, 0, 0];

We made 8 different random experiments.

the non-discretized experiments:

Note the STD here. The 2bins case is very stable. Almost all the experiments give the same result. This is the beauty of discretization. We made the objective function simple. The ANN can learn it easily and consistently. Compare that STD with the STD of the continuous case.

The performance numbers in one table:

And we make a plot of these performance numbers.

Conclusions:

– **the 10.84% CAGR of the continuous case can be doubled to 20.48% CAGR in the 10 bins case**. So, this test revealed that **it is advantageous to discretize the input.** We should decide we want to use the 2 bins or 4 bins or 6 bins or 10 bins version.

– the **highest D_stat (52.82%) and the highest TR (536%) are in the 10 bins case.** (however, note the STD increases as we increase the bins, and it shows high randomness)

– **even the most primitive discretization, the 2 bins case is better than the continuous, non discretized input.**

– **compare this performance to the performance of the 1 member ensemble case:**

In this **10 member ensemble version, all the performance measures are better compared to the 1 ensemble case.** For example **in the 10 bins version, the CAGR improved from 12.3% to 20.48% and the TR improved from 174% to 536%. (The backtest is performed for 12 years.)**

– **Personally, I would love to use the 6 bins version** (maybe with more nNeurons). Intuitively it feels right to discretize the today change as very oversold, oversold, slightly oversold, slightly overbought, overbought, very overbought. I like these 6 categories. However, the tests show unusually high STD in the 6 bins case. So, I hesitate.

– and on the other hand it can be perfectly sensible that the ANN is better than a human observer. So instead of categorizing input into 6 bins (that is reasonable to a human),** it is best for the machine to categorize it into 10 bins.** Machines are omnipotent. We should rely on them. So this test showed me that **for the ANN the 10 bins version is the best. We should use this one in the future (in the 1 dimensional discretizational case)**. Note also that we have 200 inputs only. So,** in case of 10 bins, there are 20 samples in each bins.** (it seems this is the minimum we need.) No wonder that increasing the bins to 20 doesn’t work. After a while there are too many bins, too much randomness. This has some consequence for the future. If we have 2 input dimensions, and we discretize both input dimensions to 10 bins per dimension, that will result 100 bins **for the 2 dimensional space. Maybe** that will be too much. **Better to stick to 6 bins** in the 1 dim. case, **that will gave 36 bins on the 2 dim. surface.**

– The 2 bins case has the lowest randomness. If that is important: stability (for backtest), low nEnsembleMembers, use the 2 bins case.

– With this backtest, **we again set our record: this is our best strategy so far: 20%CAGR, 536% TR in 12 years. And the only input is the RUT index current day change. This is basically a strategy that learns that the market is in MR (mean reversion) or FT (follow through) mode and acts accordingly. It should be better than simple DV2, DVB and other fixed rules.**

]]>

In spite of the

1.

Let’s suppose that our random monkey (with uniform random distribution) would predict Up direction 50% of the time and Down direction 50% of the time. Nothing wit here, we only predict blindly with pseudorandom numbers. That would be random, but our directional accuracy (D_stat) would be less than 50%. The reason is that Up days are more frequent than 50%. **In long term Up days occurs about 56% of the time. So a random predictor prediction Up days with 50% frequency will be worse than average.**

**Do you see now how difficult is to reach even 50% D_stat? It is very rarely considered. One can easily think that a random predictor (like a toss of a coin) is accurate about 50% of the time, but it is false.**

2.

**We can achieve 50% D_stat easily by this blind random monkey predictor if we create a uniformly distributed random predictor with 56% probability of predicting Up days.** Because real life has also 56% Up days, our predictor and real life match, and we can achieve 50% D_stat.

3.

**One can say that we can easily achieve even 56% D_stat. How? It is easy. The predictor should forecast every day as an Up day deterministically. ** This will surely achieve 56% D_stat, but **it is fundamentally a buy&hold approach**, and we know how bad its performance is sometimes. (lost decade after 2000)

And **this exactly reveals why the D_stat is just partly a good performance measurement.** We can easily have a 56% D_stat, but it is not the aim. **It is not the aim to be right on the direction for most of the days. The aim is to be right on those days when we expect high Up or Down movements. Catching only those high moving days, we can achieve 5 times more CAGR than buy&hold, while our D_stat can be even less than 40%. Like George Soros, we don’t have to be right many times, but when we are right, we should gain huge, when we are wrong, we should lose petty.**

I hope this post reveals why we should always consider the CAGR and D_stat measurements in unity when we consider that one strategy is better than another.

– **A high D_stat alone doesn’t mean it is a profitable strategy (buy&hold has 56% D_stat) **

– **A high CAGR alone doesn’t mean that it is a consistently good strategy. The Varadi DV2 has a huge CAGR in 2008 (90%?), but was mediocre in another years.** Having high CAGR means that regime was favourable to the strategy, but **it is only a lucky period, subperiod, and there is no guarantee that lucky period will last.** (MR stopped working in 2009, 2010).

4.

We were curios what is the D_stat of **the successful DV2 (DVB) strategy from Varadi. In the last 5 years backtest, the CAGR was 29.57%, while D_stat is 51.88%**. We were disappointed a little bit about this D_stat. **So, the DV2 is not as good after all.** **Most of its CAGR gain is contributed only in the 2008 period. Without the 2008 period, we guess, its CAGR would be about 18%.**

**In spite of what we thought about our petty D_stat performance, the DV2 51.88% D_stat tells us that when our backtest show 52% D_stat and 15% CAGR, we are quite good.** And we shouldn’t forget that our ANN is adaptive that cannot be said to most of those rule based strategies like DV2. So even if we achieve exactly the same performance numbers as a rule based strategy, I would still sleep better with an adaptive approach.

]]>

1.

**Varadi achieved 49% CAGR gain and 56% D_stat in 2010!**!. Quite remarkable. But** with fixed rules and in hindsight, it is easy.**

2.

Varadi is right with his strategy in 2010, but **we prove that in the last 5 years, the relationship is exactly the opposite. So, in the long term, we cannot use a fix rule.** There are periods (2006) when there was negative correlation between the FXE and RUT, but in 2010 there was positive correlation. Luckily, we use an adaptive approach and we train the ANN only on the last 200 days data.

Let’s start from **2005-12-12. That was the FXE inception date.**

If we discretize the input into 2 bins, we have the following function to approximate.

**There is a negative correlation between the FXE and nextDayRUTChange.** If FXE is down, next day the RUT is likely up; if FXE is up, next day the RUT is likely down.

However, note the same chart **using only the last 200 days from today** (2010-12-01)

Varadi was right. **There is a high positive correlation between current day FXE change and next day RUT change.** If FXE is down, next day the RUT is down; if FXE is up, next day the RUT is up.

If we discretize the FXE input change into 4, 6, 10, 20 equal sized bins (starting from 2005-12-12) we get the following charts:

3.

Let’s see a TR (Total Return) chart of the continuous case from 2006 to 2010. (5 years)

At the start (**in 2006), the inverse correlation prevailed and the ANN profited from it. Later in 2008, it changed to the positive correlation. In the transition period, in the period when the regime changed, it is natural that the ANN is very bad at prediction**. It uses the last 200 days for learning, but those samples are from the previous regime. However, **as the new regime behaviour is learned, the ANN shines again.** This is a typical ANN behaviour. The regimes should prevail long enough that we can leverage on the last 200 days info. Of course the 200 days is a parameter and that parameter is found by trial and error in the development of the ANN phase.

**D_stat: 53.44%, TR: 42%, CAGR: 15%**,

Note that **in the last 250 days (last 1 year), it achieved about 70% return. We even beat Varadi in his own game. (with his 49% CAGR)**

And I bet that in that period we also achieved the 56% D_stat Varadi published.

4.

We run 8 different random experiments. (no ensemble, standalone FF predictors)

The performance of them can be seen for the 2, 4, 6, 10, 20 bins case as well as the continuous case:

5.

The performance plots are here

Notes:

– It suggest that **using the 6 bins version is the best for D_stat, but using non discretization is best for the CAGR**. But **the difference is not significant.**

– As before, **if ANN randomness is the concern, use the 2 bins version.** That is very stable,** even if it doesn’t give the best prediction.** But backtests can be reproduced anytime.

– We were pleasantly **surprised how well the non-discretized (continuous) case performed. It may be that we won’t use discretization in the final predictor.** Note also that **the 6 bins case STD (high) is almost the same as the continuous STD. We observed the same in the CurrDayRutChange version. There is no point of discretizing to 6 bins if it is true. (However, there is a point of 2 bins, if we require stability.)**. We will get back to this after we see an ensemble versions of these backtest.

– The 10 bins case performed very poorly in all 8 experiments!. It is scary, but no explanation yet.

– We also conclude that **the FXE input is a little bit better than the RutCurrDayChange input for prediction power. **

See this of RUT currDayChange input chart for comparison:

For example, the **D_stat, for the 2bins case, the CurrDayRutChange version gives 51.7% (in 12 years backtest) while the FXE input gives 52.28% (in 5 years).** The other measurements (CAGR, TR) are not really comparable, because we performed the FXE test for 5 years instead of 12 years, and in 2008 was a very bad year for almost any adaptive strategy that learns from the past.

– We had about 52% D_stat in the currDayRutChange input case. About 52.5% D_stat in the FXE case. However, **it would be foolish to expect that after we combine the two inputs, we will have an additive 54% D_stat. The reason is that the FXE and currDayChange has some correlation. They are likely to move in tandem.** So, when we aggregate them to a combined input, sometimes, we don’t add extra info to the ANN.

– It is the same as with deterministic rules. Varadi in the aforementioned link achieved 49% CAGR with the deterministic FXE (in hindsight). He also achieved 30% CAGR with the MR player DV2 strategy in 2010.

See link here

However, it doesn’t mean that if we deterministically combine his FXE and DV2 strategy, we will have 49%+30%=79% CAGR.

– **According to Varadi’s experiment, FXE was better than DV2 in prediction (46% vs. 30% CAGR). We also have the same opinion based on the ANN experiments that the FXE input is a little bit better than the RutCurrDayChange input**** for prediction power.**

– Note that our **ANN predictor** is not a rule based deterministic strategy. Rules are easily formed in hindsight. However, it is adaptive. **If things change, it requires some time (100-200 days) to adapt to the new environment, but it adapts after a while.** Therefore, my realistic maximum expectation for our FXE ANN strategy was half of his results: 25% CAGR and 53% D_stat. But as you can see with 70% CAGR, we even beat his numbers in 2010.

]]>

This is the function if we discretize it to 4 bins (but using the same 10 bins visualization as before)

The parameters of the experiments:

`nNeurons = 2;`

nEpoch = 5;

lookbackWindowSize = 200;

outlierFixLimit = 0.04;

nEnsembleMembers = [1, 0, 0];

We made 8 different random experiments per bin version:

the non-discretized experiment:

**Note the STD here. The 2bins case is very stable. Almost all the experiments give the same result. This is the beauty of discretization. We made the objective function simple. The ANN can learn it easily and consistently. Compare that STD with the STD of the continuous case.**

And we make a plot of these performance numbers.

Conclusions:

– **the 8.7% CAGR of the continuous case can be doubled to 15.38% CAGR in the 4 bins case. So, this test revealed that it is advantageous to discretize the input. We should decide we want to us the 2 bins or 4 bins or 6 bins version.**

– **our most important measure is the D_stat. And the highest D_stat (52.24%) is in the 6 bins case. (however, note the high 0.58% STD that shows high randomness)**

– Personally, I would like to use the 6 bins version (maybe with more nNeurons). Intuitively it feels right to discretize the today change as very oversold, oversold, slightly oversold, slightly overbought, overbought, very overbought. I like these 6 categories. However, the tests show unusually high STD in the 6 bins case. So, I hesitate.

– **The 2 bins case has the lowest randomness.** If that is important: stability, low nEnsembleMembers, use the 2 bins case. That is very stable even with only 1 ensemble member.

– Note that the volatility = randomness can be attacked by increasing the members of the ensemble. So our decision should be:

**if we have 1 ensembleMembers, use 2 bins.
if we have 2-5 ensembleMembers, use 4 bins
if we have 6+ ensembleMembers, use 6 bins.**

– We may **repeat the same experiment with 10 ensembleMembers. If the volatility can be decreased, we would like to use 6 bins further on.**

]]>

– We contend that **increasing the number of epoch in the training from 5 to 10 was good four both our performance and the randomness.** We only expected to see less randomness, but even the performance is increased.

– **the D_stat increased from 51.4% to 51.9% and the CAGR increased from 8.84% to 12.65%.**

– **by doubling the number of epoch, the STD, the randomness decreased to half.**

– **In backtests, we will use 5 epochs (to save precious computation power) , but in production environment, in which case we need to train the network only once (for today), we will use at least 10 epochs.**

– note however that increasing the epoch further was not always advantageous in the continuous input case. I share a weird thing that I don’t fully understand yet.

When I manually inspected both the histogram of the input and the predicted output function (the ANN surface) I was surprised to find that

– nEpoch = 5: there are many different random versions of the ANN function (no constant function). we don’t overtrain.

– nEpoch = 250: the NNSurface is always the same useless constant function.

So, as we increase nEpoch to 250, we actually get worse Function approximation. That is something to investigate later in the Synthetic Random Function case.

That is very weird. Another interesting thing is even if I set the nEpoch to be 250, very frequently the system stops after about 100 epochs with the message trainLm.stop =“Minimum gradient reached.”. The reason why it happens:

”

The purpose of training is to minimize the objective function,

not achieving zero error (or some other specified low value).

When minimization occurs, theoretically, the gradient is zero.When a

computer taking finite steps gets sufficiently near a local min,

the gradient will be less than some small value. The program

suggests 1e-10 is sufficiently small for believing that you are

sufficiently near a local min. I agree, assuming inputs, initial

weights and other learning parameters are properly scaled.

”

So, the system is stuck in another local minimum. Should I try the online (not batch training)

Batch training is not proved to find the minima. Use adapt() instead.

Tried Adapt(): the system doesn’t stuck in that stupid constant function solution. The result is different every time, even with 2500 nEpoch, it doesn’t converge.

With batch training after 250 nEpoch, it always gives the same constant function. What I expect that prediction of the Gaussian is the problem. There are some outliers that still kill the learning.

We suggest to test it with the discretized case (2 bins) instead of the continuous case. If after 250 epochs, the discretized case doesn’t stop prematurely and doesn’t give the useless constant function, we pronounce the discrete case to be the winner.

To brief the current post: **Can we decrease volatility and randomness by increasing nEpoch? Yes, and the good news that both randomness and performance improves. The bad news is that we can increase nEpoch from 5 to 10, but we cannot increase it too much (to 250)**, because of some weirdness in the learning of the Gaussian random function.

]]>

”

I would like to know what the differences are between the ADAPT and TRAIN functions.

Solution:

The difference between these training methods is that** ADAPT is optimized for situations where the order in which the data is presented matters.** An example application would be filtering a time-based signal. **TRAIN disregards the sequential order of the data, and treats the error of the entire set in each training epoch. If you use ADAPT with an input sequence of [1 10 2], you’ll see poor performance when passing in the input sequence [2 1 10].** If you use TRAIN, then you should get the same result for an input of 2 no matter its location in the sequence of inputs.

”

Our measurements, based on these parameters.

Input: only the continuous currDayChange. (no dayOfTheWeek input)

Target: next day change

Parameters:

`nNeurons = 2;`

nEpoch = 5;

lookbackWindowSize = 200;

outlierFixLimit = 0.04;

nEnsembleMembers = [1, 0, 0];

We made 8 random experiments. The performance measurements are here:

When **using Adapt()**, our most important measurements, the **D_stat drops from 51.74% to 50.15%. 50% directional accuracy means, it has no predictive power at all.** (A significant part of this underperformance is due to the fact that we use continuous, non discrete input.) We have 200 samples in the training set, because our lookback days is 200 and **it seems that Adapt() over-emphasizes the last samples, so in our case it is not advantageous to use. We learnt a new thing today.**

]]>

In this post we only visualize the data and try to make some conclusions. T**he previously studied dayOfTheWeek input has two very good properties that are not valid for the currDayChange input:
– dayOfTheWeek is a discrete data. Its values are 1, 2, 3, 4, 5.
– by and large, each discrete value has the same number of samples. The number of Mondays and the number of Tuesdays are about equal.
These two properties are very handy and facilitate training the ANN. The currDayChange is not this kind of data. Therefore we have very serious doubt that we can achieve the same kind of prediction power than in the dayOfTheWeek case. The currDayChange is a continuous data and it has a Gaussian distribution, so moving away from the mean, the number of samples decrease.**

If forecasting continuous, Gaussian distributed function is not possible (we will try), or if we got very poor result, we may try to convert the continuous, Gaussian case to a discrete case. That is an idea to try. We sort the inputs into a fixed number of bins, so we achieve discrete input and we can assure that the bins contain approximately the same number of samples. As a last resort, we can try this prediction technique later.

In this post, we show 2 sequences of plots (actually 3). In the first sequence, we just average the nextDayChange values in each bin. The second sequence shows the same, but the greater than 4% nextDayChanges (the targets), the outliers are eliminated. The reason is that unfortunately the system is sensitive to outliers (the ANN approximates the mean and the mean is very sensitive to the outliers.) So, we got a better picture if we eliminate the outliers. And as we studied, eliminating outliers improves the prediction power. Our ANN will learn the outlier free data, so it is sensible to plot the outlier free data now.

F(currDayChange) = nextDayChange history (3200 days, 12 years).

The bins contain approximately equal number of samples. There are 2 bins, 4 bins, 6 bins, 10 bins, 20 bins versions.

Sequence 1: with outliers

Sequence 2: without Target Outliers;

Notes:

– the 2 bins shows that the **Russell 2000 is a FT (rather than a MR index)**. Up days are followed by higher Up days and down days are followed by lower Up days. Ok, it is not strictly FT (in FT, down days are followed by down days), but it is not MR either. But it is more FT than MR.

– at the first sight, the 2 bins case charts can suggest that we cannot be clever enough, because our best forecast is always to vote for an Up next day. Because even Down days are followed by Up days. So, based on this data, a perfect predictor would always vote for Up day. Note that it is only true when we regard the whole 12 years period. However, our lookbackDays window is only 200 days. That was the optimal for the dayOfTheWeek case, we may find another lookback work better for this input. We contend that when we look only the last 200 day periods, we will find many times (especially in bear markets) that there is a real FT behaviour: Down days are followed by more Down days.

–** As we increase the number of bins, the randomness of the data is revealed gradually. That makes our life as a misery. **While it is not too difficult to match a continuous sigm() or atan() function combinations to approximate the 2 bins, 4 bins, 6 bins versions, it is very difficult to approximate the 20 bins case with a smooth continuous function. At the end randomness trumps everything. We see why we are not very optimistic that predicting random continuous functions is possible at all with good accuracy. However, discretization of the input is one solution that may help in prediction. We are free to select the best data representation that we can find that helps the predictor to do its work. Using ANN is not a science, it is an art. Finding a good data representation need many creative ideas. One of the ideas is the discretization (others: dimension reduction, discretization modulo rotation, normalization, detrending, calculating higher order functions instead of raw data, etc.)

– To us, **predicting the 4 bins or 6 bins version seems to be most promising. They reveal the nature of the function, but they don’t look too random to be impossible to approximate with a smooth continuous function.**

**What is the nature of the function to be predicted? (based on the 6 bins case chart:** both with or without outliers)

– the discretization bins represent: very oversold, medium oversold, slightly oversold, slightly overbought, medium overbought, highly overbought conditions. The border between these categories are -1.47%, -0.65%, 0%, +0.65%, +1.47%. So, everything over 1.47% daily gain means overbought.

– when currDayChange is very overbought, next day seems to mean revert (down day). Investors reckon the previous day was too much.

– when currDayChange is mildly overbought that is the best for the next day return. This can be explained by the general bullishness of the market on that day. Investors wanted to buy stocks on the previous day, but because the market was up ‘mildly’, they thought the market was too extended on that day. So, they postponed their buying for the next day, hoping that prices will mean revert. However, the prices mean revert only if previous prices were higher than +1.47% on the previous day. So, investors wait in vain, and they drive up the prices big on the second day, because they are impatient.

– usually when prices are down (under 0%), the next day is not so good. It is not really FT in the long term 12 years case, but close to it.

– The difference between the outlier eliminated and non-eliminated case is the strongest in the very oversold case. If we don’t eliminate outliers, we have a very bullish Up day next day. This can be justified by a mean reverting explanation. If something is much oversold on a panic day, next day its price may be lift up. But note that this doesn’t happen in the outlier eliminated chart. When we eliminate outliers, we eliminate the non-normality of the market, we just have non-panic and non-happiness-madness days. In normal days, the rationality is that there is a real fundamental reason why the previous day was a very oversold down day (and not a panic reason) and this fundamental reason probably last for a long time implying FT and saying that the Down days should continue.

– The 20 bin case is very chaotic with the outliers, but without the outliers one can imagine the true nature of the underlying function. We try to show it with a red ‘smooth’ line in the following image. This is how we would approximate it.

– in outlier elimination we mean that when the TargetDailyChange is bigger than 4% change, we exclude the sample from the training. We call this TargetOutlierElimination. One can argue that we can similarly do an InputOutlierElimination.

It is exactly what we do in the 3rd Sequence. This charts are made by excluding all samples in which either the input or the target dailyChange is bigger than 4%.

Excluding only the targetOutliers eliminated 80 samples from 3267. (2.5%)

Excluding both the input and target outliers eliminated 142 samples from 3267. (4.3%)

Sequence 3: without Input and Target Outliers:

It is interesting that the second Sequence (with only Target outlier elimination) looks less random in the 20 bins case than the third sequence (with target and input outlier elimination). We should develop some theory why we want to predict that version, and not elimination outliers from the input only from the target. We will try to predict all 3 methods.

]]>

Let’s do some quick backtests for the different dayOfTheWeek representations.

Use 10 ensemble members. Apply the FF(%) method that is the FeedForward network (not the GRNN), and the % means that we predict the next day %changes, and not the sign(change). We train for 5 epochs and we use 2 neurons if not stated otherwise. The lookback window size is 200 days, and we used fix 4% output outlier elimination threshold.

However, note that as we increase the input dimension with a new currDayChange input, it would be sensible to add at least another neuron to the network. But the number of neurons has to be determined by a rigorous optimization algorithm and not by this ‘sensible’ hunch.

**Backtest 1: dayOfTheWeek is 5 dimensional.**

**Without currDayChange input:**

D_stat: 51.56%, projectedCAGR: 11.25%, TR: 133.81%

**With currDayChange input:** (2 different experiments. The results are random of course.)

D_stat: 51.59%, projectedCAGR: 11.54%, TR: 147.10%

D_stat: 52.09%, projectedCAGR: 15.81%, TR: 283.79%

In average, every measurement is improved. Some notes:

– The 2 experiments give very different results. This suggests that it is a very volatile method. It is expected. As we increased the input dimensions space, the ANN training algorithm can vary more in different experiments. To decrease volatility we can increase the number of epochs and/or we can increase the number of ensemble members.

– It is a beautiful result, considering we used only 2 neurons while the input dimension is 5+1=6.

**Backtest 2: dayOfTheWeek is 1 dimensional.**

**Without currDayChange input:**

D_stat: 52.45%, projectedCAGR: 11.77%, TR: 148.93%

**With currDayChange input:** (3 different experiments.)

D_stat: 52.68%, projectedCAGR: 18.67%, TR: 410.62%

D_stat: 52.06%, projectedCAGR: 11.30%, TR: 139.74%

D_stat: 52.32%, projectedCAGR: 11.10%, TR: 134.43%

With currDayChange input, but with 3 neurons instead of 2: (2 different experiments.)

D_stat: 53.09%, projectedCAGR: 18.72%, TR: 413.51%

D_stat: 52.32%, projectedCAGR: 15.93%, TR: 289.30%

A TR% chart for the new method can be seen here

We like the chart. It is quite monotonous (except in the 2008 earthquake, but that is acceptable).

Overall, **it seems promising to use the new currDayChange input. For the 1 dimensional case, it can improve the CAGR from 12% to 16% and the TR from 150% to 300%. These are our best results so far!!! **However, we shouldn’t rush to this conclusions yet. We need to learn and optimize many things; we want to understand the behaviour of the components. Overall, I reckon we have to spend at least 1 full month before concluding that it is really a promising strategy and before announcing real backtested results. The issues we want to solve don’t take too much to code, but it takes too long to backtest, to run the actual experiment. **A backtest for an ensemble of 10 members for 3000 days** (12 years) takes about 2 hours. Because of the randomness, we have to repeat every test at least 2-4 times. **That is 8 hours to try even a simple thing.** And if we make a mistake in the code, or when we fine tune parameters (nNeuorns, nEpoch, lookbackdays), we have to repeat these 4 experiments many times. Even if there is nothing surprise in the road ahead.

However, there is one thing that is very important to learn. With this new currDayChange input, we introduced a continuous, non-discrete and very random (almost Gaussian random) input. We don’t fully understand yet how ANN predictions work for this kind of input. Whet it is successful, when it is not. Can it predict at all? Note that because of the Gaussian randomness, the samples on the edges are thinly represented. Can the ANN learn this kind of data well? These questions have to be studied. And exactly these very important questions are those that are neglected by other ANN practitioners (even in academic articles). Newbie ANN users hope that it is enough to feed the ANN whatever input we have and it will predict well. Ouch. The truth is very far from this as we have proved in this blog many times.

This is the **roadmap **how we imagined to make headway.

–** study the 1 dim. case with only the currDayChange** input. **Test daily MR (Mean Reversion), FT (Follow Through) strategy.** It didn’t work in our previous tests 6 months ago. Why should it work now? We hope that many things have changed since: we have output normalization, input normalization and there is outlier elimination now; to mention some.

– Write a deterministic predictor with different input data representation (continuous, 2 bins, 6 bins)

– test a continuous input representation

– test with discrete input representation: 2 bins,

– test with discrete input representation: 6 bins with equal number of samples

– train it for optimizing (nNeurons, nEpoch),

– optimize lookbackDays

– **study the 2 dimensional input**

– Write a deterministic predictor with different input data representation

– continuous case

– test with discrete input representation: 2 bins,

– test with discrete input representation: 6 bins with equal number of samples

– train it for optimizing (nNeurons, nEpoch),

– optimize lookbackDays

– **study the 6 dimensional input**

– Write a deterministic predictor

– continuous case

– test with discrete input representation: 2 bins,

– test with discrete input representation: 6 bins with equal number of samples

– train it for optimizing (nNeurons, nEpoch),

– optimize lookbackDays

– heterogeneous ensemble

– make the prediction live on the Internet as NeuralSniffer Predictor version 2.

Overall, these premature backtest of this post shows that **we can have hope that the new currDayChange input improves the prediction performance by at least 50%. Instead of 12% CAGR, we target 16% CAGR.**

]]>

Testing the new input, the currDayChange in backtests, we learned another thing about the range of inputs: It does matter if the range is very different across dimensions for the GRNN case. In our first test, the first input dimension, **the dayOfTheWeek had a range of [1..5]. The second input dimension, the currDayChange had a range of about [-0.05..+0.05]. The difference is about 100x times.** Guess what happened. **The GRNN that in theory used both the dayOfTheWeek and the currDayChange inputs completely ignored the currDayChange. The version that used the extra currDayChange gives exactly the same prediction then the one without this input.**

It completely neglected the new input with its tiny range. We were surprised, but in hindsight, it is perfectly understandable. **The GRNN works as a RBF network, by storing the input samples in the network and optimizing (learning) the r radius for every input sample and forming a spheres in the hyperspace. However, the radius of the sphere is the same across all the input dimensions.** (It is not an hyper-ellipse). So, **in the GRNN case, it is very important that the input dimensions have the same (or similar) range.**

And the FeedForward ANN case, it doesn’t hurt as well.

As a start, we used this normalization function to map the maximum value of the input to 1 (or to map it to 2 by a multiplier 2):

function [nnInput multiplier] = NormalizeInput(nnInput, inputVectorDim, isUseCurrBarChanges, isDirectionCurrBarChange)

indexCurrBarChangesMin = min(nnInput(inputVectorDim,:));

indexCurrBarChangesMax = max(nnInput(inputVectorDim,:));

indexCurrBarChangesAbsMinMax = max(abs(indexCurrBarChangesMin), abs(indexCurrBarChangesMax));

multiplier= indexCurrBarChangesAbsMinMax / 2;

if (isUseCurrBarChanges)

if (isDirectionCurrBarChange)

nnInput(inputVectorDim,:) = sign(nnInput(inputVectorDim,:));

else

nnInput(inputVectorDim,:) = nnInput(inputVectorDim,:) /multiplier;

end;

end

end

Later we thought that using **a fix 2x multiplier is awkward and it doesn’t adapt to the different regimes, for example different volatility environments. So, we introduced the stdev as a multiplier.** **Two standard deviations away from the mean account for roughly 95 percent of the samples.** When we map the 2x stdev to 1, it means that only 5% of the currDayChange samples fall out of the range [-1..1]. As we use a rolling window of about 200 samples, it is only about 10 samples out of 200 in our case. And this solution is adaptive.

indexCurrBarChangesStdDev = std(nnInput(inputVectorDim,:));

multiplier= indexCurrBarChangesStdDev * 2;

The result:

the GRNN in the 1dimensional dayOfTheWeek case:

using the [-0.05..0.05] currDayChange range:

D_stat: 51.19%, projectedCAGR: -5.89%, TR: -67.41%

using the 2xstdev mapped to [-1..1] currDayChange range:

D_stat: 51.21%, projectedCAGR: 0.83%, TR: -26.62%

Yes, the TR is still negative. We know that the GRNN doesn’t really work for this task. But notice that the **CAGR has turned from -5% to +1%**. Overall this change is an improvement.

We think that **it doesn’t help too much in the FF ANN case (because of the default mapminmax), but as it doesn’t hurt either, in the future, we always normalize all the input dimensions to the [-1..+1] range.**

]]>

Here is the collection of studies on the weekend effect:

http://calendar-effects.behaviouralfinance.net/weekend-effect/

”

The weekend effect (also known as the Monday effect, the day-of-the-week effect or the Monday seasonal) refers to the tendency of stocks to exhibit relatively large returns on Fridays compared to those on Mondays. This is a particularly puzzling anomaly because, as Monday returns span three days, if anything, one would expect returns on a Monday to be higher than returns for other days of the week due to the longer period and the greater risk.

”

In our previous studies, we usually shorted the market on Monday, because of the bearish Monday effect. However, it is worth mentioning that it is not a must. In case the expected shorting profit is not greater then a money market fund daily %rate, it is not worth shorting the market. It is better to stay in cash, and collect the interest payment. It is especially true, because if we are out of the market (we are in cash) on the down Mondays, we earn the cash interest for 3 consecutive days (Sat, Sun, Mon). So, shorting the bearish Monday is only worth doing if the expected value is bigger than 3 times the daily interest on the treasuries. Therefore, in real life, we probably wouldn’t play shorting the market on Friday close. We will test this possibility in the future.

So far in our studies, we used only the day of the week as input. However, there are studies that investigate how the previous day %return affects the next day %return on specific days of the week.

1.

One study here from QUANTIFIABLE EDGES shows that in the 1 year from 2009 March to 2010 March, buying down Friday closes was profitable.

http://quantifiableedges.blogspot.com/2010/03/after-down-fridays-over-past-year.html

”

It appears the edge has only been on down Fridays.

It is important to understand that this is what I often refer to as an “environmental edge”. In other words, it is something that has worked in the recent past and seems to be a result of the current market environment. It is not an edge that has persisted over a long period of time nor do I expect it to continue to persist for a long period of time from now. That doesn’t mean it isn’t a useful observation, though. In such cases where I believe a setup contains an environmental edge I will look to use it to my advantage until it appears to be losing its effectiveness.

”

2.

Note that in the comment section of that article somebody contends that buying any Friday closes was profitable.

I have to agree with the commenter. The year 2009 was a very bullish day and the Bullish Monday effect was alive.

The commenter also note that this effect worked only on that specific period, in that bull regime. I cannot concur more.

3.

In the list from http://calendar-effects.behaviouralfinance.net/weekend-effect/, there is one from Abraham Abraham,

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=5358

”

Abstract:

It is well known that stock returns, on average, are negative on Mondays. Yet, it is less well known that this finding is substantially the consequence of returns in prior trading sessions. When Friday’s return is negative, Monday’s return is negative nearly 80 percent of the time with a mean return of -0.61 percent. When Friday’s return is positive, the subsequent Monday’s mean return is positive, 0.11 percent. This relationship is stronger than for any other pair of trading days and is most acute in small- and medium-size companies. The trading behavior of individual investors appears to be at least one factor contributing to this pattern. Individual investors are more active sellers of stock on Mondays, particularly following bad news in the market.

”

Consider 2 inputs for our Russell 2000 index (RUT) investigation from 1997 till November 2010: the day of the week on day T, the %return on day T and the output as the %return on day T+1.

Running the %return input from -2% to +2% by 0.1 increments, and averaging the next day return in the bins, we can plot in 3D this chart: (click the images for the full size version)

The separate slices are:

Monday:

Tuesday:

Wednesday:

Thursday:

Friday:

And here is the aggregation of the different slices together:

We know that the Friday has a bearish output (bearish Monday). This bearishness can be seen in the whole Friday spectrum. Note that close to the zero, there are many samples, but as we move away from zero, there are exponentially less samples (almost Gaussian distribution), so the statistics there is less reliable. These plots contain the average %return however, so the aggregated %gain is divided by the number of samples in the bucket. But be warned that the ANN doesn’t learn exactly this average %gains. In these plots, the samples in the middle are divided by a higher value (high number of samples fall there), while samples at the edge are divided by a lower value. So, the samples at the edge are artificially bumped up. It is given more weight than it deserves. That is OK for these plots, but the ANN training algorithm treats all samples equally, when calculating the Error. Nevertheless, it is sensible to plot these charts, because we can visualize these returns more clearly.

What we would like to point from this discussion that it is worth inspecting the samples close to the zero point. Take the Friday slice plot. The first 4 bars on the left to zero are negative bars, so in the grand average, it is very probably that considering all samples under zero, the grand average is negative. On the other hand, take the samples to the right of zero. You can find 3 positive bars. So, it is possible, that the grand average above zero will be positive.

Note also that as the input approaches +2, the output (next day %gain) become very negative. This is some kind of mean reversion (MR) that works here. If Friday is very-very much up, Mondays are extreme bearish.

It is more clear, if we divide the second input not to 21 buckets (from -2 to 2 by 0.1 increments), but only to 2 bins. The division is made on the borderline of zero. This is the plot:

It is clear instantly that the quote from the study is true

”

When Friday’s return is negative, Monday’s return is negative nearly 80 percent of the time with a mean return of -0.61 percent. When Friday’s return is positive, the subsequent Monday’s mean return is positive, 0.11 percent.

”

It is a surprise, because we use the RUT and the study mentioned used the SPX and we expect a slightly different behavour from different indices.

So Friday express a kind of follow through (FT), not mean reversion (MR) behaviour.

Similarly, conclusions can be deducted for other days, like Thursday. For example, if Thursday is down, the next Friday is usually very bullish, but if Thursday is up, the next Friday is slightly bearish. It is a kind of mean reversion (MR) behaviour.

Just for comparison, here is the %return when we aggregate everything along the day-of-the-week input:

Our conclusion in this post is that after studying the input output statistics, training the ANN on these 2 inputs seems to be promising. In the next posts, we will do exactly that.

]]>

http://94.125.180.164/NeuralSniffer/version1/RUT_DayOfTheWeekANN_version1.html

It is updated daily at 9 a.m. London time (GMT).

Contains up-to-date figures for the RUT index daily %return distribution for the last 50 days, last 100 days, last 200 days and since 1987.

]]>

We performed the directional prediction, that is the classification task (prediction of only sign(next%return)) and the function approximation task (prediction the ‘next%return’ value) in the previous two posts. Generally, the function approximation worked better (note that in our previous classification test, we haven’t removed the outlier days and we used 1 dimensional encoding in the previous test, but we will remove outliers in this test and use 5 dimensional encoding). However, the classification was not bad. (It has changed now.)

We used nEpoch = 5 for all FF (Feed Forward network) tests. In GRNN there is no nEpoch parameter, we used the default spread = 1 there.

We compare the homogeneous 10 member ensembles to the heterogeneous 10 member ensembles:

The performance of the

E1. Homogeneous (GRNN(%return)) ensemble [0 0 1] equivalent to [0 0 10]

E2. Homogeneous (FF(%return)) ensemble: [10 0 0]

E3. Homogeneous (FF(sign(return))) ensemble [0 10 0]

E4. Heterogeneous 8-1-1 ensemble [8 1 1]

E5. Heterogeneous 6-2-2 ensemble [6 2 2]

E6. Heterogeneous 4-4-2 ensemble [4 4 2]

are compared.

For example the [8 1 1] means (8 (FF(%return)), 1 FF(sign(return)) and 1 GRNN(%return))

Note that the GRNN is not a random algorithm; therefore the 10 member ensemble is equivalent to the standalone version.

There are many approaches for aggregating the votes of the members.

Quote from the Kin Keung Forex prediction book

”

Typically, majority voting, ranking and weighted

averaging are three popular decision fusion approaches.

Majority voting is

the most widely used fusion strategy for classification problems due to

its easy implementation. Ensemble members’ voting determines the final

decision. Usually, it takes over half the ensemble to agree a result for it to

be accepted as the final output of the ensemble regardless of the diversity

and accuracy of each network’s generalization. Majority voting ignores the

fact some neural network that lie in a minority sometimes do produce the

correct results. At the stage of integration, it ignores the existence of diversity

that is the motivation for ensembles (Yang and Browne, 2004). In

addition, majority voting is only a class of integration strategy at the

abstract level.

Ranking is where the members of an ensemble are called low level classifiers

and they produce not only a single result but a list of choices ranked

in terms of their likelihood. Then the high level classifier chooses from this

set of classes using additional information that is not usually available to

or well represented in a single low level classifier (Yang and Browne,

2004). However, ranking strategy is a class of fusion strategy at the rank

level, as earlier mentioned.

Weighted averaging is where the final ensemble decision is calculated in

terms of individual ensemble members’ performances and a weight attached

to each member’s output. The gross weight is one and each ensemble

member is entitled to a portion of this gross weight based on their performances

or diversity (Yang and Browne, 2004).

Generally, there are two ensemble

strategies: linear ensemble and nonlinear ensemble strategies.

A. Linear ensemble strategy

Typically, linear ensemble strategies include two approaches: the simple

averaging (Tumer and Ghosh, 1995; Lincoln and Skrzypek, 1990) approach

and the weighted averaging (Burges, 1998) approach. There are

three types of weighted averaging: the simple mean squared error (MSE)

approach (Benediktsson et al., 1997), stacked regression (modified MSE)

approach (Breiman, 1996a) and variance-based weighted approach (Tresp

and Taniguchi, 1995).

Simple averaging is one of the most frequently used ensemble approaches.

After selecting the members of the ensemble, the final prediction

can be obtained by averaging the sum of each forecaster’s prediction of ensemble

members. Some experiments (Hansen and Salamon, 1990; Breiman,

1994) have shown that simple averaging is an effective approach to improve

neural network performance. It is more useful when the local minima of

ensemble members are different, i.e., when the local minima of ensemble

networks are different. Different local minima mean that ensemble members

are diverse. Thus averaging can reduce the ensemble variance. However,

this approach treats each member equally, i.e., it does not stress

ensemble members that can make more contribution to the final generalization.

If the variances of ensemble networks are very different, we do not

expect to obtain a better result using simple averaging (Ueda, 2000).

Weighted averaging is where the final ensemble prediction result is

calculated based upon individual members’ performances with a weight attached

to each individual member’s prediction. The gross weight is one

and each member of a ensemble is entitled to a portion of this gross weight

according to their performance or diversity. There are three methods used

to calculate weights: the simple MSE approach (Benediktsson et al., 1997),

stacked regression approach (Breiman, 1996a) and variance-based weighted

approach (Tresp and Taniguchi, 1995).

B. Nonlinear ensemble strategy

The nonlinear ensemble method is a promising approach for determining

the optimal weight of neural ensemble predictor. The literature only mentions

one nonlinear ensemble approach: the neural network-based nonlinear

ensemble method (Huang et al., 1995; Yu et al., 2005c). This approach

uses “meta” neural networks for ensemble purposes (Lai et al., 2006a).

Experiment results obtained show that the neural networkbased

nonlinear ensemble approach consistently outperforms the other ensemble

approach.

”

Note that we cannot average the votes here, because the FF(sign(return) network predicts directions only and it wouldn’t be fair to combine this with actual %return forecasts.

Therefore in this study, we used here the majority vote to aggregate the votes of the members, that is

resultForecast = sum(sign(forecasts));

1.

It is strange that the FF(%return) network returns so good CAGR, while having so bad D_stat. Because of this, using the D_stat as a performance measure for comparing these algorithm is not advisable. **We** better **relate to the CAGR and TR measurements** now.

2.

Based on the performance measurements, **just blindly aggregating the member forecasts in the [4 4 2] case, there is no improvement.** In this case, the weights of the different type of the algorithms are 4,4,2. That is very close to the 3.3,3.3,3.3 equal weight weighting scheme. However, with this heterogeneous ensemble approach **using the majority vote ensembling when members got equal votes, we got bad result, because:**

A.

**The standalone algorithms, namely the FF(sign(return)) and GRNN versions doesn’t work. **They gave negative TR. It is not a surprise that combining them is not very good. Albeit the combination of them is better then the worst of the standalone algorithm, but the combination [4 4 2] is not better than the best standalone, the FF(%return) ANN.

(Note. in our previous studies, the FF(sign(return)) and GRNN worked only for 1 dimensional case without outlier elimination, and we hadn’t tested them for the 5 dimensional case.)

B.

**It shows only that the equal weighting majority voting doesn’t work.** In the future, we may try other voting mechanism, like the averaging the forecasts, or some other confidence based weighting. For example, giving more weight to the FF(%return) ANN, that performs very well as a standalone algorithm.

3.

For the CAGR and TR, **the best performance is obtained by the heterogeneous [8 1 1] network.** This proves us that a blind equal aggregation of the members with w=1 weight doesn’t work, but **when a good predictor have higher weight (in this case FF(%return) has 80% of the votes), the overall prediction improves, even if the other 20% members are generally losers.** What happens here is that if a decision of the FF(%return) are almost equally bullish as bearish (they have 4 bullish vote, 4 bearish vote), it is good to aggregate a new player, a new strategist into the picture. But **the main point** of this study is this: **when ensembling networks, never use the equal weight approach. The winner standalone strategies should be given higher weights.**

**The heterogeneous ensemble can be a better predictor than the homogenous, if the weights are selected according to their underlying performance.**

]]>

Monday: [1, 0, 0, 0, 0]

Tuesday: [0, 1, 0, 0, 0]

Wednesday: [0, 0, 1, 0, 0]

Thursday: [0, 0, 0, 1, 0]

Friday: [0, 0, 0, 0, 1]

In hindsight, I prefer the -1..+1 encoding instead of the 0..+1 encoding, but because we trust the Matlab newff() to use the default ‘mapminmax’ input preprocessing to map the input to -1..+1, it doesn’t really matter how we select our input range. (The output range does matter, but that is another story).

We run 2 different tests. One with a **standalone** ANN predictor:

The other with the ensemble method in which the ensemble contains 10 members.

The **ensemble** uses the Sum(sign()) instead of the Avg() of the members as in our previous tests.

forecast = sum(sign(standaloneForecasts));

Notes:

1. **The D_stat directional accuracy (51.5%) is not as good as in our best 1 dimensional encoding case (53%), but suprisingly the CAGR (Commulative Annual Growth Rate)=12.6% and TR (Total Return)=135% are quite pleasant.** For the nEnsembleMembers = 10, nEpoch = 6 case, here is the Total Return chart.

Not bad. The volatility is not too high.

2.

**We consider this test to be successful. The 5 dimensional encoding is better than the average 1 dimensional case (but not better than the best 1 dimensional encoding). Therefore, we would like to use the 5 dimensional encoding in the future. It is based less on luck** and require less parameter fine tuning (encodingTypeModulus = 0, 1,2, 3, 4 parameter can be omitted in the future test)

3.

**What is the explanation of this better ‘overall performance’?** Let’s try to visualize the encoding in the 2 cases. It is difficult, because we cannot visualize a 5 dimensional function. F(x) = y, in which x is 5 dimensional, y is 1 dimensional. At first, let’s inspect the 1 dimensional case. Suppose we start to approximate the function in the first image. I illustrated a solution with a red line.

Compared to this, try to approximate a function in which Friday is a huge up day. A likely solution is the red line:

**The huge up days of Friday doesn’t modify the Monday forecast, because Monday is not the neighbour of Friday.** However, it increases the forecasted Thursday %gain, because Thursday is a neighbour of Friday in this encoding scheme.

Let’s see what happens **in 5 dimensions.** Actually, plot only the 2 dimensional subspace of it:

The red rectangles represent the 1st forecast case: the small Friday version. The green line is the first forecast line. However, **when Friday went up,** in the blue rectangle case, **the approximated blue line is increased**, but it is higher not only for the Friday point, but** for the Monday point too.**

This happens because our approximation function has some constrains. It has 2 neurons, 2 weight parameters + the bias, that is 3 parameters to synthesize and we are in a 5 dimensional space. So, the function approximation is not a straight, linear line, but a curve. (which is not illustrated well here.)

**In the 5 dimensional space, the Friday becomes the dimensional neigbour of Monday.**

Condiser even more detail: **moving from 1 dimension to 5 dimensions completely redesigned the ‘neighbour’ relationship.** **While in the 1 dimensional space Wednesday was 2 steps away from Friday (indirect neigbour), in the 5 dimensional space, everyone is direct neighbour to everyone. That is a very strange, new concept.** Note that Friday is now neigbour to even Wednesday and Tuesday.

**Is it good or bad? We don’t know. It is different.** Probably it depends on the specific task.

For example **in 2010 October bull market, all the 10 pushups of the market happened to be on Friday. In the 5 dimensional case, this elevates the forecast for all the days**: Monday…Friday, **giving an overall bullishness into the prediction, while in the 1 dimensional case it only elevates the forecast of Thursday, Friday.**

In an opposite, bearish, market, the same can happen. If all the selling happens to be on Mondays (bearish Monday), it will give an overall bearishness to all days in the 5dimensional case, while it will decrease the forecast of only the Monday, Tuesday in the 1 dimensional case.

4.

The tests show that for the standalone case the number of epoch = 5 is quite good, and for the **ensemble case we will pick the nEpoch to be 5** as well. We will chose this configuration in the future.

5.

We found that **there is less randomness, less variance in the 5 dimensional case than in the 1 dimensional case. **

See this image from the previous post:

You can see in the performance images. In the 1 dimensional case, nEpoch=4. nEnsemble =10, Modulus = 0, the D_stat varied between 52.36% to 53.53% when we run the same experiment multiple times due to randomness. However, in the 5 dimensional (nEpoch=4. nEnsemble =10) case, the D_stat is quite stable from 51.24% to 51.54.

The same can be observed for TR. In the 1 dimensional case, even with 10 ensemble members, nEpoch=4, Modulus=0, we could see TR= (123%, 436%, 211%, 250%) in the same tests. With the 5 dimensional case the TR is quite stable at TR=(134%, 138%, 137%, 149%).

**Overall, we decide that the 5 dimensional case is more stable. In spite of decreasing our previously best CAGR% = 16.30%** (1 dimensional, Modulus = 0)** case, we would like to move to a less profitable, CAGR% = 12.68%, but 5 dimensional input in the future.**

It doesn’t mean we completely drop the 1 dimensional input. It can be very handy for visualizing data or testing new ideas and understanding it. So probably in future debugging scenarios, visualization, we would like the 1 dimensional case, but in production environment, we apply the 5 dimensional input.

]]>

”

The authors report the results of an empirical study about the effect of input encoding on the performance of a neural network in the classification of numerical data. Two types of encoding schemes were studied, namely numerical encoding and bit pattern encoding. Fisher Iris data were used to evaluate the performance of various encoding approaches. It was found that encoding approaches affect a neural network’s ability to extract features from the raw data. Input encoding also affects the training errors, such as maximum error, root square error, the training times and cycles needed to attain these error thresholds. It was also noted that an encoding approach that uses more input nodes (more dimensions?) to represent a single parameter generally can result in relatively lower training errors for the same training cycles (but more epochs necessary to train?)

”

**In our previous studies, we mapped the days of the week to numbers. Our straightforward mapping was:
Monday: 1
Tuesday: 2
Wednesday: 3
Thursday: 4
Friday: 5**

The problem is that we have a cyclical input: after Friday the next days is Monday. However, we have to map this cyclical input to a serial input. In real life the neighbour of Friday is Monday, but in our encoding the Monday is not the neighbour of Friday. We have cut the cyclical chain somewhere in the circle. Cutting the Monday from the Friday was an obvious choice, but maybe not the best. In this post, when we flatten the cycle to a 1 dimensional line, we have to put the discontinuity somewhere. For example, we can cut Tuesday from Monday, so the Tuesday-Monday neighbourness relationship disappears.

Note that with this encoding, we don’t have a 0 as an input. Whether it is good or bad, we don’t know. But we assume it is insignificant, because the default newff() in Matlab uses ‘mapminmax’ function for preprocessing the input to -1..+1. We assume (because we debugged the code) that the input preprocessing works in Matlab. We know from previous studies that the output preprocessing doesn’t work in Matlab as it is supposed, so we better preprocess the output ourselves, but we omit preprocessing the input and let Matlab do it.

We tested various versions of encoding.

Let ‘i’ be a number from 0 to 4. We tested 5 versions with the following code:

`dateWeekDaysOrig = weekday(dates) - 1; % this maps the date to 1..5 (Mon..Fri)`

dateWeekDays = mod(dateWeekDaysOrig -1 + i, 5) + 1;

For** i = 0 case, the mod() function line does nothing. That is the case we used in our previous studies.**

For i = 1 case, the encoding is as follows:

Monday: 2

Tuesday: 3

Wednesday: 4

Thursday: 5

Friday: 1

As having a backtest we trust, we mirrored the input as well. This should have the same performance as the original version:

Monday: 5

Tuesday: 4

Wednesday: 3

Thursday: 2

Friday: 1

Our backest results the following measurements.

**This measurement shows that far the best way is the Modulus = 0 case, namely the encoding that we used in our previous studies.**

Why does it happen?

We approximate the daily %gain distribution by a nonlinear smooth function. This function tries to smooth out the big differences between the days. For example, if Friday is a huge down day, but all of its neighbours (Monday, Thursday) are huge up days, and **if we pick an encoding that the Friday is somewhere in the middle of the range (Friday = 2, 3 or 4), the huge down feature of the Friday will be smoothed out by its huge up neighbours**.

The situation is completely different if we allow the Friday to be at the edge (Friday = 1 or 5). In that case, the Friday down-ness will be affected only by 1 Up neighbor (, but not 2); That effectively can let the Friday be represented as a negative day.

Let’s analyze our backtested results. Why are these differences? For example in the Modulus = 0 case (original), the ratio of upForecast/downForecast for Fridays are up only 44%. (only 44% of the Fridays are predicted as Up days). However, this is the case, when the Friday is at the edge of the range. (Friday = 5).

In the Modulus = 1 case, Friday = 1, but let us see the Modulus = 2. In that case Friday = 2, so Friday is in the middle. In that case the ratio of upForecast/downForecast for Fridays are 54%. And we know from experience that Fridays are usually bad. This is because as Friday went into the middle of the range, the neighbour days elevate its approximated %gain.

As an example, see this two different encoding of exactly the same data and an estimated function approximation (just illustration) of it.

We plot the 10 years aggregate daily %gain for the case when **Friday encoded as 5.**

And when **Friday encoded as 2.**

We conclude that **the encoding mechanism does matter very much and previously we were very lucky** in our encoding mechanism. That was the most sensible encoding, because we assumed that the largest discontinuities in the market happens at the weekend, so the **most sensible way is to put the discontinuity between Friday and Monday**, so Friday and Monday are separated. In the future, we will stick to our previous encoding mechanism. We learned our lesson today. **Emphasizing that the encoding is also a parameter of the algorithm**, we introduce two additional variables in the code:

`encodingType = 1dimension/5dimension;`

encodingModulus = 0; % as default

Two notes:

– a GA (Genetic Algorithm) method can synthesize the optimal parameters in the future, so it is important that the GA can tweak the input representation in the future.

– the **MLP is sensitive to the input encoding very much**, but other algorithms, like a k-NN algorithm, or the previously introduced **GRNN with 0.1 spread should be insensitive** to it. That is the strength of GRNN we shouldn’t forget in the future.

]]>

The good thing about the GRNN is that there is less parameters to tinker with.

– **GRNN is not random. There is no random initial weight initialization.** If we repeat the experiment the second time, we got the same result.

– **the only parameter to tinker is the ‘spread’**. In contrast to the MLP, where our parameters are the number of hidden layers, the number of neurons per layer, the learning factor, the number of max. epochs, the validation/train/test sample proportions, etc.

– **GRNN is not sensitive to output scaling.** If we multiply the output by 100, we get the same result. (As we saw in our previous posts, it is not true to the MLP).

More about it for example in this page

http://www.dtreg.com/pnn.htm

1. Changing the ‘spread’ parameter.

For our first test day, the GRNN has to learn this function:

The GRNN surface becomes this for the different spreads of 0.1/0.5/1.0/2.0. The spread = 1.0 is the default behaviour.

2. Back tests

Series 1: test outlier threshold sensitivity:

no outlier limit: D_stat: 51.86%, projectedCAGR: -0.58%, TR: -34.89%

outliers = 4%: D_stat: 52.48%, projectedCAGR: 4.26%, TR: 7.12%

outliers = 3%: D_stat: 52.45%, projectedCAGR: 9.10%, TR: 72.29%

outliers = 2%: D_stat: 52.00%, projectedCAGR: 9.99%, TR: 87.47%

Some TR% charts:

spread = 1, outliers = 4% (TR: 7.12%)

spread = 1, outliers = 3%: (TR: 72.29%)

Series 2: test spread sensitivity

For the outlierFixlimit = 3% case, the spreads

Spread = 2.0: D_stat: 51.71%, projectedCAGR: 1.30%, TR: -20.77%

Spread = 1.0: D_stat: 52.45%, projectedCAGR: 9.10%, TR: 72.29%

Spread = 0.5: D_stat: 50.72%, projectedCAGR: 1.15%, TR: -22.03%

Back tests result:

– GRNN is not sensitive to target normalization (multiplying the target gives the same performance)

– **spread = 1 is the best.** (from spreads = 0.5/1.0/2.0)

– sign(nnTargetNormalized) is not bad, but not excellent, so we ignore it. (D_stat: 52.30%, projectedCAGR: 3.77%, TR: 1.88%)

– outlierFixlimit = 3%, results better TR (70%) than the 4% limit; but outlierFixlimit = 4% results better D_stat (marginally). **We will chose outlierFixlimit = 3% in the future ensemble.**

– not excluding outliers gave bad result

3.

Conclusion:

**The GRNN is not better in our forecasting problem than the traditional MLP.** It may be superior in other problems.

The TR% and CAGR% measurements were not really good (they can be unlucky), but **we value the D_stat% the most, and that is more than 52%** directional accuracy. So, we think **it is a valuable asset in the ANN arsenal**. It is quite similar to k-NN algorithm for small spreads like 0.1.

The GRNN approach and parameter tinkering can be very unlucky (by chance), but because **it is deterministic**, if we run 20 times, it can be unlucky 20 times. So, we cannot attack this luckiness, unluckiness by running 20 different GRNN experiments. Therefore, because we cannot trust the GRNN approach, we cannot advise to use it as a standalone algorithm. But because it is a different approach than the MLP, in the future we add this kind of predictor **as a member of the ensemble** with small weight.

]]>

nnTarget = ‘nextDay%gain’*Multiplier;

with various Multipliers.

**In other half of the experiments, we used different** scaling mechanisms we call **squashing **methods.

**Our theory is that we are best to squash the target range into -1..+1.** But there are different kinds of techniques to achieve this.

For example, we can center it around zero (zero is mapped to zero), or we can center it around the average (average is mapped to zero).

Or we can choose not to center at all when squashing to the -1..+1 range.

Note the code from the Matlab file that defines the Squashing functions from -1 to -7:

`targetMin = min(nnTargetWithoutOutlier);`

targetMax = max(nnTargetWithoutOutlier);

targetMaxAbs = max(abs(targetMin), abs(targetMax));

targetMean = mean(nnTargetWithoutOutlier);

targetMeanMaxAbs = max(abs(min(nnTargetWithoutOutlier-targetMean)), abs(max(nnTargetWithoutOutlier-targetMean)));

```
```

`if (p_targetMultiplier == -1)`

nnTargetNormalized = ((nnTargetWithoutOutlier - targetMin)/(targetMax - targetMin) - 0.5) * 2 * 0.9; % to -0.9..+0.9

elseif (p_targetMultiplier == -2)

nnTargetNormalized = ((nnTargetWithoutOutlier - targetMin)/(targetMax - targetMin) - 0.5) * 2; % to -1..+1

nnTargetNormalized = tansig(nnTargetNormalized*4); % this will squash the bottom 20%, that is those under -0.4, to the bottom 5%, that is those under -0.9

elseif (p_targetMultiplier == -3) % centering around zero but not around the mean

nnTargetNormalized = (nnTargetWithoutOutlier + targetMaxAbs)/targetMaxAbs - 1; % to -1..+1

elseif (p_targetMultiplier == -4) % centering around zero but not around the mean

nnTargetNormalized = (nnTargetWithoutOutlier + targetMaxAbs)/targetMaxAbs - 1; % to -1..+1

nnTargetNormalized = tansig(nnTargetNormalized*4); % this will squash the bottom 20%, that is those under -0.4, to the bottom 5%, that is those under -0.9

elseif (p_targetMultiplier == -5) % centering around zero but not around the mean

nnTargetNormalized = (nnTargetWithoutOutlier - targetMean + targetMeanMaxAbs)/targetMeanMaxAbs - 1; % to -1..+1

elseif (p_targetMultiplier == -6) % centering around zero but not around the mean

nnTargetNormalized = (nnTargetWithoutOutlier - targetMean + targetMeanMaxAbs)/targetMeanMaxAbs - 1; % to -1..+1

nnTargetNormalized = tansig(nnTargetNormalized*4); % this will squash

elseif (p_targetMultiplier == -7) % centering around zero but not around the mean

nnTargetNormalized = ((nnTargetWithoutOutlier + targetMaxAbs)/targetMaxAbs - 1)*100; % to -100..+100

Conclusions:

1.As our outlier threshold is 4%, **we expected that the multiplier=25 gives the best result**, because that is the one that maps the output range best to the -1..+1.

However, **it seems that the multiplier 100 is the best.** This can be only by** pure luck.** Note the** Experiment 3 **in the Multiply 100 case. That one **achieved more than 500% TR%.** **That is probably only pure luck. **However it contributes much into the average. So, maybe the fact that multiplier 100 is the winner only pure randomness.

2. There is no question about it that **the original Multiplier = 1 case are not the optimal.** This study revealed that we can gain more with any multiplier: 25 or 100.

3. From the squashing function experiments, we conclude that using **the tansig() as a second preprocess always worsen the result.**

Our **idea come** from here:

”

Target normalization

Why target normalization ? Because building a model between the data elements and their associated target is made easier when the set of values to predict is rather compact. So **when the distribution of the target variable is skewed, that is there are many lower values and a few higher values** (e.g. the distribution of income ; the income is non-negative, most people are earn around the average, and few people make bigger money), it is **preferable to transform the variable to a normal one by computing its logarithm. Then the distribution becomes more even.**

”

The tansig squashing function would make the distribution more even. We have few outliers at the edged and the bulk of the samples are crowded near the mean, near zero. For example the tansig(nnTargetNormalized*4) squashes the bottom 20%, that is those under -0.4, to the bottom 5%, that is those under -0.9. **It seemed a good idea, but our measurements doesn’t confirm this to be a good idea.** For example the Squashing-2 function was so bad, we tried to debug it.

”

Our note:

the first day forecast is negative, even if the input (= day 4) is the most bullish day of the week, with hugely positive %return.

It is because the minValue is too small: -3.7. (the maxValue is +3.0). The ANN forecast something onto the middle. (an average); this is perturbed by a -3.7 subtraction. That is huge.

Imagine a situation when our range is -10..+2. If we center it around the zero; our forecast will be near zero; and then we subtract -10 from it when we de-normalize.

”

4. From the squashing function, it turns out that re-centering (and **after re-centering we rescale**) the distribution around the mean (instead of around zero) **is not a good idea.**

5.

**In the future, we use**

nnTargetNormalized = nnTargetWithoutOutlier*25;

or

nnTargetNormalized = nnTargetWithoutOutlier*100;

or the

**Squashing-3:**

nnTargetNormalized = (nnTargetWithoutOutlier + targetMaxAbs)/targetMaxAbs – 1; % to -1..+1

Interesting that this Squashing-3 looks like a recenting function, but in theory, it is not.

(nnTargetWithoutOutlier + targetMaxAbs)/targetMaxAbs – 1;

**equals to
nnTargetWithoutOutlier/targetMaxAbs; **

For the first 200 days training samples, targetMaxAbs = 0.037, so the multiplier was 1/0.037=27. So, it is usually a little bit higher than 25.

**We prefer Squashing-3, because that is adaptive to the range. It is not fixed.** For example, if targetMaxAbs is 0.01 only (not 0.04), it can stretch the range more.

**In case targetMaxAbs = 0.02, the multiplier becomes 50.**

Note also that Squashing-3 has very good D_stat, even if its TR are ot the best. This can be due to randomness. From the 3 performance measurements, I prefer D_stat. That tells the most about the prediction power. A good TR% and CAGR% can be a result of only some outlier days.

]]>

`nnTarget = sign(nextDay%gain); `

However, as we proved in the previous post for a mathematical case, the performance of the ANN depends very much on that the target is scaled into a proper range or not. We experiment with it. Our target is therefore:

`nnTarget = sign(nextDay%gain)*Multiplier; `

where the multiplier is from 0.0001 to 10,000. See the image for details.

As usual, we emphasize that the ANN is non deterministic. Therefore we perform 4 separate experiments per Multiplier.

Parameters:

`nNeurons = 2;`

nEnsembleMembers = 10;

nEpoch = 4;

lookbackWindowSize = 200;

We doesn’t decimate outliers, because the %gains are replaced by only the sign.

However, in theory, there would be a strong point for outliers removal. We had a theory that when the market swings more than 4% in a day, there is something serious going on that cannot be explained by our model. Our model is that the %gain depends on only the day of the week. In these environment, we shouldn’t use the data for training.

In hindsight, we should remove outliers. But we have already made the experiments, and it took 2 nights to run, so we will not repeat them.

Let’s see the performance measurements:

We are pleased with the result. This is what we expected. The prediction power of the ANN is best if the multiplier is +1, that is the range of the target is between -1 and +1.

Note that even if Matlab in theory automatically scales our target to -1..+1 by its mapminmax, we should scale it ourselves and we shouldn’t rely on the automatic Matlab mechanism that is buggy. (see our previous 2 posts.)

Conclusion: **In our previous studies, we used the target as the %gain of the next day.** For example **the +1% gain was represented as 0.01.** So, if we consider 4% outlier removal, our effective range was -0.04.. +0.04. According to the experiment of this post, **it is not optimal.** We should apply a multiplier of about x25 to the %gain target to **scale them up into the -1..+1 range.** We will do this experiment in the next post.

]]>