1. Visualize  your data

Our quest is supervised learning is to find function f(x) that is likely to generate your training set. The training set is defined by in which the input X, has output Y labels attached to them. One thing you learn quickly is the importance of analysing you data. There are some problems with it.

In one hand, the dimensionality: It is very frequent to have multidimensional data (10, 20+), however we — Earth people — are very poor visualizing anything that more than 3 dimensional.

In the other hand, the problem of visualizing is that if the data contains a lot of noise, it is difficult to see any meaningful structure in the data.

Luckily, in our experiments, we try to minimize dimensional complexity. Mostly, to mitigate the problem of overfitting.

We showed (2 posts ago) that 2 dimensional time series prediction was better for VXX than 1 dimensional one.

Therefore, we continue with the 2 dimensional case.

Our x1 dimension (horizontal axis) is the %change today, x2 (vertical axis) is %change yesterday.

Having 3 years of historical data, let’s look at it:


This plot shows when the tomorrow %change is positive (green + sign), or negative (red o sign).

Do you see any meaningful structure?

Not easy. Because of the lot of noise (and unfortunately it is not a white noise).

Some things can be concluded though:

– there seems to be more red dots overall. (expected) More VIX down days.

– green (VXX up days) are probable when either today was up highly or yesterday was up highly (expected: volatility brings more volatility)

– down VXX probable when the market is peaceful (small up, small down moves in the last 2 days)

But overall, the plot looks so random, it is difficult to imagine how can we separate the two groups: the positive days from the negative.

Obviously there is no linear separator.


This plot is useful, if we do classification to 2 groups (Up, Down), but what it we would like to do classification into 3 groups:

Bullish days, Bearish Days, Cash days. Cash would mean that the %change was mild: -1%..+1%.

Let’s make a plot. The black diamonds represent those Neutral days.


More or less the same can be said. Extra conclusions can be made like:

– there are no black dots (Neutral days), if today or yesterday %change is extreme (so the Neutral days happens usually in less volatile regime)

– if the VXX %gain was higher than +20% today (2 cases), then it was followed by another VXX increase

– when the VXX %gain was negative today, and negative yesterday, it is likely it will be negative tomorrow.  (the VXX has a daily FollowThrough, momentum)



 2. Visualize  your final  fitted prediction model  (f(x))

Let’s suppose we do a Linear Regression learning described in the previous posts.

How does the decision surface look like?

It looks something like this:

We draw the decision boundary by black dotted line. That represents the Y values where the f(x) is zero. That separates the up forecast from the down forecasts. The plot is dated on 2011-10-28.

The prediction can be made manually from the plot, if we know the %change of yesterday (vertical axes) and %change today (horizontal axis). For example, if both is 0%, the prediction is in the yellowish (upper area), the prediction is a positive tomorrow %change.

Note that observing the f(0,0) is a good way to evaluate whether the current model is Upside biased or Downside biased. Because it was taught by feeding the last 93 trading days samples, and from August 2011 we are in a very volatile period, it is not a shock that f(0,0) is positive, so the model mostly predicts Positive values. (Positively biased)

As we anticipate negative VXX changes in the foreseeable December, Xmas season, it is not advised to start trading the strategy right now.



See the 2 previous posts about the VXX estimation. We contended that there is only 1 parameter: the number of days we look back to get training samples.

To be frank however, there are more parameters:

–          whether selecting 1D or 2D case; or

–          the machine learning algorithm used: (Normal Equation or Gradient Descent, etc.), or

–          selecting the kind of instrument: VXX

However, let’s say that those parameters are not really parameters per se.

They were determined much earlier. They were determined by some other fundamental ideas we believe, and therefore we don’t optimize those parameters. For example just accept that we want to estimate daily VXX (not the RUT or AAPL). There is nothing really we can fine tune in it.

Therefore those parameters are not the focus of any sensitivity analysis.

After the prologue, let’s do some sensitivity analysis on the lookback days.

Note that we had to fix the startDay of those algorithms in theses backtest. Because we use maximum 200 days lookback, the first estimate can be calculated for the day: 201 (in the 1D case) and day: 202 (in the 2D case).

In this test, to make a fair competition between the different lookbackDays, we started all from day 201 or 202.

In theory, the 50 days lookback version can be started from day 51. However that would give extra advantage for the shorter period lookbackDays. (if they have larger period to play).

We want a fair comparison, so we cannot allow that.

Note that this is the reason, why for example in the previous post (1D) case we showed that lookbackDays = 50 was the best, achieving 10x multiplier.

This cannot be found here, because of the aforementioned reason.


Sensitivity Analysis (487 days = less than 2 years, assuming 250 trading days):

1. 1D case:

Let’s plot the final portfolioValue as a function of lookbackdays.

In the chart, the X axis is the lookbackDays – 1, so the chart is shifted by one, but it is OK.

(click on the image, if you want to see it proper original size)

based on that: the optimal lookback is somewhere between 30 to 60.                                                  Is it sensitive to the parameter? Yes, as it is usual. For example, the best parameter value gives 7x multiplier, the worst is 1x multiplier; so we can say it is quite sensitive to the parameter.

Note the range of 2-20 training samples: that is hardly enough samples; I wouldn’t consider those area to be useful at all, even if it shows a good performance.

So, the optimal value of the parameter is somewhere between 30 and 60. One strategy is (if we want to avoid parameter fine tuning) is just play middle: 45.

Do you see the danger here? Someone, who optimized the parameter and haven’t done any sensitivity analysis, thinks that it returns 7x per 2 years, and starts to play the strategy. But in real life, he can be unfortunate to have only 1-2x return in the future. (or he can be also lucky and get 14x return in the future.) The point her is that the future return should be expected less than it is shown in a fine tuned parameter backtests.

One idea to make it more stable:

Do different parameter runs (from 30 to 60): average their prediction; this may partially eliminate the parameter fine-tuning bias.

So, let’s define our UltimateEstimator by aggregating the decision of 30..60 lookbackdays.

The portfolio value curve of the UltimateEstimator gave PortfolioValue of 3.90:

UltimateEstimator is between the extremes: it is better than the worst (2x multiplier), but it is worse than the better 7x multiplier

However in real life, it is better to use this kind of estimator; it decreases the ‘lucky’/unlucky factor of the dependency of ‘parameter fine tuning’, concrete parameter selection.

Also, in general the aggregated Ultimate profit curve is smoother (less likely to contain DD), albeit, the -50% DD is still preset here, but even with that DD, it is smoother than the individual strategies

2. 2D case:

Let’s plot the final portfolioValue as a function of lookbackdays:

Based on that: the optimal lookback is somewhere from 70 to 115.

Is it sensitive to the parameter? Yes, as it is usual.

best parameter value: 7.5x multiplier, worst is 2x multiplier.

even with the unluckiest pick of the worst parameter, the profit was 2x (so, it is not a loss). That is good.

The only loss is in the range of 2-10, and 35-40. There are not enough training samples there.

Someone, who wants to avoid parameter fine tuning bias, may choose the middle of the range: 93.

Another idea to make it more stable: the same UltimateEstimator. Do different parameter runs (from 75 to 110): average their prediction; this may partially eliminate the parameter fine-tuning bias.

Aggregating the decision of 75..110 lookbackdays, the result UltimateEstimator gave PortfolioValue of 4.24

That is between the extremes: it is better than the worst (2x multiplier), but it is worse than the better 7x multiplier.   

3. Conclusion:

Note that with the Ultimate(75-110) version, we eliminated the parameter=fixLookbackdays, but we introduced 2 new parameters (instead of 1): 75 and 110.   :), so we again have some parameter bias; albeit note that we wanted to optimize the fixLookBackday parameters, but we haven’t ‘really’ optimized the range parameters: 75, 110.

The important note is that we introduced 2 new parameters, but the final result is not really sensitive to changing these 2 new parameters. Changing 75 to 76, hardly changes anything, while in the fixLookbackDays parameter case, changing that parameter from 93 to 94 had more significant effect on the final outcome.


This is the key message of this post: we cannot eliminate parameters, but what we can do is to assure that if we have parameters, the final outcome is not significantly sensitive on the parameters used.

Use 1D or 2D?

Comparing the 200 days long 1D vs. 2D sensitivity chart (not the Ultimate Portfolio Value chart), we prefer the 2D inputs case.

The maximum achieved is similar to the 1D case (max x7 multiplier was achieved: the 2D case achieved it about 3 occasions, the 1D only 1 times)

The minimum is better in the 2D case. In the 1D case, if we pick the wrong parameters, we can have a profit of 1x.

However, for the 2D case, even if we picked the wrong parameters, we can have a profit of 2x.

Comparing the range based Ultimate Portfolio Value chart,

The 2D case is better also, because for example the less DD. (see the big DD that we had in the last 3 weeks in the 1D case). The 2D case equity curve looks smoother too.

As the continuation of the previous post, let’s study the Linear Regression in which we have not only 1, but 2 variables.

Let’s assume we want to forecast the next day %change of VXX as an output variable, based on the today %change of the VXX and the yesterday %change

The linear equation would look like this.

Y = beta0 + beta1*X1  + beta2*X2


X1 = yesterday %change,

X2 = today %change,

Y = next day %change.


The unknown is the beta0, beta1, beta2. We want to determine (learn) them.

Let’s suppose that learn them by looking back in the history by D days, where D can be 20, 50, 100, 200 days.

The corresponding equity curve charts:



–          not much. They are all similar. That is good, because they are consistent

–          the max. DD is: from 3.5 to 1.5: -57% (scary)

–          in the 1D case, the lookback50 was the best, in the 2D case the lookback20 is the best. (probably it is just randomness)

–          they are similar to the 1D charts. So it seems that introducing another variable (yesterday %change) doesn’t give more useful information for the prediction. It gave more information, but that information was not useful for extra the profit. This can be typical for machine learning. If we introduce a completely random extra variable (as a new dimension) (a non-dependent variable), it can even destroy the prediction power of the simpler case.

–          Based on these charts, we would stick to a simpler 1D Linear Regression than the 2D version. That may have a little better profit potential.

An advantage of attending a university course is that it broadens the knowledge someone have. But even more important to that, it adds new usable tools into the repertoires that we keep in our toolbox.

The Stanford University Machine Learning course mentioned in the previous blog post is not only theoretical, but very practical indeed. I would say it is even more practical than theoretical, that is a bad news for theoretical mathematicians, but good news for the applied scientists or for programmers. The course force students to write homework programs every week. The suggested language is Octave, that is a free, open source version of Matlab. One of the topic in the last week was Multivariate Linear Regression and two approaches for the solution: the Normal Equation and the Gradient Descent.

In the context of this blog, we pursued Neural Network based solution of the problem, but for this post, let’s just solve the Matrix Equations.

In this post, let’s assume we want to forecast the next day %change of VXX as an output variable, based on the today %change of the VXX.

The linear equation would look like this.

Y = beta0 + beta1*X ,


X = today %change,

Y = next day %change.

The linear regression is finding the coefficient of the line that mostly fits to the data, like here:

I usually say that from the sample points we regress back the line (we determine it, we guess it) that is very likely to generate those sample points.

The unknown is the beta0, beta1. We want to determine (learn) them.

Let’s suppose to learn them by looking back in the history by D days, where D can be 20, 50, 100, 200 days.

beta0, beta1 = ?

How to solve it?

The solution is the OLS estimator, where OLS stands for Ordinary Least Squares and

In a nutshell, you have to evaluate this equation, which is using Octave/Matlab matrix operations, it is pretty straightforward.

For the geeks, see the details here: http://en.wikipedia.org/wiki/Linear_regression

I would like to stop here a little bit. Just looking at the equation: Beta = (X’X)^-1 * X’ y.

Why is it the equation? The proof is pretty straightforward.

Consider the original eqution:

X *Beta= y.

Try to determine Beta = ?

We cannot multiply both sides by X^-1. Why? because X is not a square matrix. If X is not a square matrix, there is no inverse matrix.

So, multiply both sides by X’ first (X transpose) to have

(X’X )*Beta= X’y

Now (X’X) is a square matrix, so we can have an inverse. Let’s multiply both sides by this inverse.

(X’X)^-1 *(X’X) *Beta= (X’X)^-1 *X’y

that is equivalent to:

Beta= (X’X)^-1 *X’y



The advantage of the OLS method compared to the Neural Network, or Gradient Descent is that it is

– deterministic. All the Neural Network solutions are randomized, therefore requires a lot of random runs for backtesting. In contrast, OLS requires only 1 backtest.

-easy to compute (takes half a second)

OLS doesn’t require normalization of the samples.

-the whole method has only 1 parameter: lookback days. That is contrast to the NeuralNetwork based solution that has another parameters: lookbackdays, outlier threshold, numberOfRandomRuns, weighting of the decision of the neural network, normalization parameters (SD or min-max normalization, range normalization or mean normalization too?).

-having only 1 parameter significantly reduces the parameter fine-tuning bias that distorts the results of many backtests.

– The disadvantage of OLS that it can capture only the linear relations of the inputs vs. output.  In contrast to the Neural Network that can describe any continuous functions.

In our concrete example, we took the VXX close prices from its inception. That is about the beginning of 2009.

We run the algorithm with lookback days = 20, 50, 100, 200.

We also plot the SMA70 of the strategy (as a mean to use some playing the equity curve technique).

The return curves of the strategy looks like this:


What can we realistically say. The charts are simiar.

-For the 200 days lookback, we can see a it went from 1 to 3 in about 2 years. That is 70% CAGR. Not bad.

-However, the maxDD was -50% (2010 summer), which is pretty high.

– the best performer was the 50 lookback days (probably that should be played in real life). That multiplied the initial deposit by 10 during 2.5 years.  That is about 150% CAGR, but we consider this performance as an outlier. Also note how volatile was this in 2011 august (albeit volatile in the favoured direction).

-someone could start the strategy when the profit curve is above the SMA70, as it is now (as a means of money management)

-someone could start the strategy when the profit curve is higher than the previous highest high (maybe it is safer: less whipsaw)

On the other hand, it is worth mentioning that these are only theoretical results. Real life can be harsher than this. Sometimes because of the parameter fine tuning bias, sometimes because of that real life order execution is not perfect: (ask-bid spread, commissions, not executed short sale orders, because there were not enough stocks to be available to borrow, etc.)

In future posts, we will examine the 2D input case, and we will also do some Sensitivity Analysis on the ‘lookbackdays ‘ variable.

I would like to draw your attention to a unique Stanford University initiative. In this season, the first time ever, you can participate in a unique research project that intends to change the future of the education.

Stanford University has announced to make 2 courses available online Worldwide!

-Introduction to Artificial Intelligence

-Machine Learning

An exceptional thing about this course (compared to other online courses like the MIT online courseware) that it is not simply viewing offline videos later, anytime when you have free time, but you do homework, assignments, test, and exams as you would do it in a case if you are really a Stanford University student. You even get a certificate about the completion and certificate of your own results, comparing your results to the rest of the ‘world’. 🙂

The writer of this blog is very pleased with this announcement because of:

– the firm belief that the ‘teacher’ as a job will be mostly outdated in the next century. I reckon in 30 years we will need only 10% of the teachers as we have now.

– I welcome the integration of universities/courses. This is the most efficient way to distribute the best tutors to the widest audience. I would like to see only the best 500 universities in the world to survive than having 5000 (poor) universities scattered all around. Having 5000 universities is very inefficient/costly way of distributing knowledge.

– I welcome the idea that the knowledge is public. Available to anyone from the skyscrapers of New York to the slums of India. No means testing, no university fees. Everyone is equal and it is possible for everyone (with enough diligence) to achieve university degree.

Topics include:

supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs; VC theory; large margins); reinforcement learning and adaptive control.


Probability and linear algebra is a requirement, of course.

There are 140K students applied for the Artificial Intelligence course and about 60K for the more advanced Machine Learning course.

Note that there are 2 kinds of virtual students: One that follows only the videos (spectators), but do no homework or tests. They receive no certification.

Currently, it is not published what percentage of the students are in the spectators club. And the truth is that you can change your status later during the term. If you find that you don’t have enough time, you can switch to be a spectator anytime.

The homepages:



Note the time requirement tough:

Stanford advices to spend 10 hours per week on one course. That means 20 hours per week for the two courses. Those who don’t have enough spare time can consider taking only the simpler Artificial Intelligence course only, albeit take into account if the Machine Learning course don’t start next year (then you missed your chance).

We encourage everybody who has some time to take part in these excellent initiatives and become a student of Stanford University and be (a little) proud that you participate Now in something that is the future of university level education.

The blog has been neglected in the last 1-2 months. Beside the summer beach holiday, the other reason was that we took the time and prepared the ANN strategy to be played live.
This event may be regarded as a very important milestone in the life of this blog. The sole purpose of the research (that we recorded in this blog) was to develop an algorithm based on Machine Learning (preferable Artificial Neural Network kind) that can be played live on the stock exchange.
We are happy to announce that we reached that milestone.

Actually, 2 versions are played now.

The Aggressive version.

This one is the ANN(T-0, T-1). There are no ensemble groups. It is a single ANN with 2 inputs.
It means its inputs are the today and yesterday price change of the RUT index.
We don’t use the day of the week input here. It only drags down the performance.

It is a risk taker. It never goes to cash.
You can check somewhere in the previous blog article reports about its performance numbers. The unleveraged version did about 35% gCAGR, and 40% drawdown in the past. We know nothing about the future.

Conservative version.
This one actually has 4 ensemble groups.
-ANN(T-0) // it is the same as the first group; mostly for stability
-ANN(day of the week for T-0)

All 4 groups have to agree. They have to be in consensus.
There are 3 possible scenarios:
– All 4 groups vote +1 for next day: consensus is Up
– All 4 groups vote -1 for next day: consensus is Down
– otherwise: consensus is cash

So, if this strategy is not convinced, not confident enough about the next day direction, it goes to cash. That is the conservative approach.
This conservative version had about 18% CAGR and 30% drawdown in the backtests.

Past backtests are based on playing it on the unleveraged RUT index. However, it is not possible to play the RUT index in real life. Either we play the futures or play ETFs. We picked the double ETF (ultra, and ultra short) and we play that.

That is how these strategies performed in the last month:

Interesting that the Agressive version is the laggard; albeit we expected it to have better performance than the Conservative.
Note has to be made that we were lucky with timing: if we start of the portfolio 2 weeks earlier, the profit wouldn’t be as good as now.

Obviously, the period is too short to be happy about it or draw serious conclusions. So, let’s wait and follow them.

Real live trading will be extended with another trick. We plan to use some kind of money management, in case the strategy turns sour. For example, the ‘playing the equity curve’ technique.
It hasn’t been developed yet. It should improve the future drawdowns.

Instead of backtesting strategies in this post, let’s turn our attention to a branch of mathematical science called statistics.

1. The Book


I can wholeheartedly suggest a book called
Cartoon Guide to Statistics .

There is another similar piece of work,
The Manga Guide to Statistics
I took a quick look, but its story is built around a romantic relationship of a young Japanese girl.
It is definitely not how I would like to see one of the most difficult and serious part of the mathematical science to be presented.
Never mind, manga fans may be interested in it.

The Cartoon Guide to Statistics is a very well organized book. It touches almost all part of statistics, but of course it cannot go very deeply into the topics. It is funny, amusing, it is enjoy to read.
I contend that an average secondary student should have no problem with the difficulty, albeit some reviewers (interestingly UK reviewers mostly) complained that the book is too complex, and they couldn’t follow it. Trust me; it is a very easy book.
Strongly recommended for all math students. (Hopefully in the first year at the university)

However, reading the book that very nicely summarized what statistical science is left me with an uneasy feeling.
I hope I don’t offend anybody, but to me (very personal opinion), the whole framework of the statistical tools looks only as a mathematical toy. However, in real life it doesn’t work, it cannot be used, it cannot be trusted. We can play with it, as we play with toys, but what for?
It is a nice math framework, but real life doesn’t play by these rules that are defined in statistics.

Instead of talking vaguely why I despise it, I try to give some concrete examples.

One of the tools that I think is a joke is called re-sampling.
A technique that treats the sample as if it were the population. It has other names like randomization, jackknife, bootstrapping. To me, it looks as a funny, but surely non-working tool in real life. Yes, it is true that you can prove mathematically, that it works, but in real life, would you use it?
Let’s suppose you watched how a stock was traded for a week. One week is clearly not enough to draw statistical or any conclusions that you would trust to risk your own money.
Now, clever mathematicians invent a tool called resampling (or bootstrapping, whatever).
Based on that 1 week observation, that has only 5 samples, they generate another 500 samples.
Now, you have 500 samples. So, you can make reliable statistical conclusions!… or not.
It can be proven mathematically that your generated 500 sample are unbiased estimates for the other non observed samples.However, those 5 initial samples based only on that 1 observed week. Maybe that week was the Xmas week. Even if you generate 500 samples from it, would you risk your money on non-Xmas weeks based on the 500 artificially generated samples?
However, as the unbiased nature can be proven mathematically, you have a false sense of confidence in your method.

Another suspicion: a typical thing in statistics, that they start the chapter by ‘let’s suppose that the results of each trial are independent’.
Where in life can you find something that you measure many times and they are independent? Yes, maybe the rolling dice (maybe even that is not true), but what about the stock market? Can you assume that the daily stock returns are independent from each other?
You can assume it; at your own peril! Because it is not true.
Then if it is clear that statistics can work only artificial, nonexistent scenarios, why should we use it at all?

Another simplification: let’s say we make a statistic about the repair cost of crashed cars.
We take a sample of 10 crashed cars example. Can we assume that the repair cost samples are independent, normally distributed samples of a random variable in real life?
Absolutely not. Why would they be independent? Assume there was a heat wave last week in the country. Half of the car crashed related to this event. Cars without air conditioner (more tired drivers) are represented more. Are the repair cost of the air-conditioned and non-air conditioned cars different? Yes, non air conditioned cars are usually cheaper, less costly to mend. These 10 sample cars may be not independent.
Is the other assumption, that the repair cost is normally distributed true? Absolutely not. If the engine of the car is damaged, the repair cost is much higher.
Therefore the distribution probably has at least 2 peaks. One for the cases when the engine is not damaged; another for the engine damaged cars.
So, it is nice to use statistics for real life, but be aware that you make many assumptions that are simply not true, and the mathematical tools were not designed to be used with these samples.
At least, don’t expect a correct answer from your statistician.

The book starts with a sentence:
“Statistics quantifies uncertanity. To make categorical statements, with complete assurance about their level of uncertanity.”
Complete assurance…

That is a joke.
Maybe, in the case of a rolling dice. Because a ‘perfect’, non-tempered rolling dice behaves nicely according to the Gaussian distribution.
But what about real life? Like the stock market. Stock prices are very far from behaving nicely.
Someone can say that, OK, don’t use the Gaussian statistics, use the power distribution.
But do you honestly believe that a stock price behaves according to the power law? No.
My firm belief is that there is no mathematical formula that can describe that distribution.
The stock price distribution lives in its own world. It doesn’t obey the law of mathematics.

These are only 4 reasons why I have a feeling that statistics is a childish toy only. It is a tool that we use to trick ourselves that we can understand and describe the world.
With all its delicate details and mathematical legerdemain, statistics is just a clever game for kids:
it has its rules, you can use it to amuse yourself, you can thing that you are clever, because you use it, but it is far from being usable in real world situations.

Then why should we bother with it at all?
Maybe the answer lies in a quote from Einstein:
“One thing I have learned in a long life: that all our science (‘math’),
measured against reality is primitive and childlike
– and yet it is the most precious thing we have.”

I cannot concur more.

Or equivalently the quote from Einstein: “God does not play dice.”
My interpretation of this quote is that The Universe’s (-not God; Einstein was not religious-) works in a way that it cannot be described by simple probabilities of the rolling dice (which is the Gaussian distribution).
Some people shares this kind of interpretation with me, somebody interpreted it in a web forum as “probability / statistics is wholly inadequate to explain/model real world quantum effects”.
You may have the right to disagree with me in this interpretation. The most popular interpretation is that “The Universe is not random, but deterministic” which is a viable rendition too.

2. Bessel’s correction in SD

It is universal that people, who don’t understand statistics, try to use it. (Me among them). 🙂
A typical misunderstanding is in the standard deviation (SD) that whether using N or N-1 in the denominator.
The rule of thumb is that we divide by N in the population (or model) standard deviation and we divide by N-1 in the sample SD formulae.
When some (less mathematically educated people) see that there is N-1 used in an equation in an article or a book, they even suggest with great confidence that it is wrong and the author made a mistake. (Because if they have only 1 sample, then we should divide by zero.)
However, they are wrong.

Define 2 statistics:

A. ‘standard deviation of the sample’ (SDoS)
This one is using N in the denominator. However this estimator, when applied to a small or moderately sized sample, tends to be too low: it is a biased estimator.

B. ‘sample standard deviation’ (SSD)
This one is using N-1. This is the most commonly used, the adjusted version.
This correction (the use of N – 1 instead of N) is known as Bessel’s correction. The reason for this correction is that SSD^2 is an unbiased estimator for the variance of the underlying population. (note, even SSD is not an unbiased estimator for the population SD, only the SSD^2 is the unbiased estimator for the population variance.)
Bessel’s correction corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation.

So, the term standard deviation of the sample (SDoS) is used for the uncorrected estimator (using N) while the term sample standard deviation (SSD) is used for the corrected estimator (using N – 1). The denominator N – 1 is the number of degrees of freedom in the vector of residuals.

That is, when estimating the population variance and standard deviation from a sample when the population mean is unknown, the sample variance is a biased estimator of the population variance, and systematically underestimates it. Multiplying the standard sample variance by n/(n – 1) (equivalently, using 1/(n – 1) instead of 1/n) corrects for this, and gives an unbiased estimator of the population variance.

A subtle point is that, while the sample variance (using Bessel’s correction) is an unbiased estimate of the population variance, its square root, the sample standard deviation, is a biased estimate of the population standard deviation; because the square root is a concave function, the bias is downward, by Jensen’s inequality. There is no general formula for an unbiased estimator of the population standard deviation.

One can understand Bessel’s correction intuitively as the degrees of freedom in the residuals vector:


where X_avg is the sample mean. While there are n independent samples, there are only n – 1 independent residuals, as they sum to 0

In intuitive terms, we are seeking the sum of squared distances from the population mean, but end up calculating the sum of squared differences from the sample mean which is (in effect) defined as the position which is closest to all the data points, i.e. it is the point that “minimizes that sum of squared distances.”
This estimate will always underestimate the population variance. Because it is a minimization of something. So, it is another way to see that the SDoS understate the population variance.

I illustrate the necessity of using N-1 in the denominator in the proof that shows that the SSD^2 is an unbiased estimate of the population variance.

If we put N, instead of N-1, the estimator is not unbiased.
The problem as you see is that the samples are also used to generate the sample mean (as an estimator for the population mean), and that will subtract one Sigma^2 from the variance.

– Ok, so we see that if we want unbiased estimation for the variance, we can have it (using N-1 in the denominator). However, this estimator is not unbiased for the SD.
So, either we use N or N-1 in the SD formula, it doesn’t matter, because none of them will be unbiased (because square root is concave).
But it is not too difficult to see that using N-1 is more accurate, so we should use it not only for the variance, but for the SD too.

What usually bothers people using N-1 is that if we have only 1 sample, then it is not possible to estimate the SD of the sample or the SD of the population. Because: division by zero.
And then they think the problem is in the formula, and we should use N instead of N-1.
However, it is not true. The formula is correct.
The problem is in their thinking. If we have only 1 sample, it is really not possible to estimate the population mean and the population SD at the same time.
It is crazy to expect that it is possible at all.
In estimating the SD, we have to estimate the population mean first.
So, it is really true: if we have only one sample, we can estimate one thing only: the mean.
With that estimation, we used all our 1 degree of freedom (all our data), and there is no extra information to use for estimating the variance.

Let’s suppose you have only 1 sample, and try to use the N version for calculating the variance.
What will be the result: Variance = (X1-Mean)^2. Because mean = X1, it will be zero.
Do you really accept that this constant zero is a good estimate for the population variance? (Independent of the sample)
Absolutely not. It is much better to say that we cannot calculate the variance/SD than estimating a variance (the zero) that we know for sure that is wrong.

This should be treated in our mind as the division by zero case. Division of any number by zero is not defined. Similarly,
SD of any sample having only 1 observation is not defined. Get used to it.
Algorithms, programs should return NaN (Not a Number), instead of 0 in that case.

3. Confidence in our backtested CAGR

Let’s use statistics for gaining some confidence in our Neural Network prediction system.
Our Neural Network algorithm is not a deterministic one, because of the random initialization of the NN weights. The backtest is a random variable. Its random results fluctuate around the population mean, the true expected value of the backtest. The backtests run for the last 23 years. ANN (T-0, T-1) version.
We did run 13 backtest experiments and found the following annual %profits (gCAGR):
35.08%, 33.63%, 33.29%, 37.00%, 35.02%, 35.68%, 36.14%, 33.17%, 33.94%, 34.50%, 34.04%, 35.49%, 34.75%.

The arithmetic average is 34.75%. That is the sample mean. However, it is not the population mean: the true expected value.
Is this 34.75% a good number? Should we be happy about it? Should we trust enough to play it in real life?

There are 2 ways to answer the question: is 34.75% gCAGR good enough or not? (Both approaches are viable to answer this question)
A. with confidence intervals: e.g. with 95% confidence, we can say that the true gCAGR is between X, Y
B. with hypothesis testing: Assume gCAGR=0% (=H_0); what is the chance of having 34.75% as the sample mean? Can we disprove H_0?

Note that intuitively, looking at these 13 numbers, we feel that the strategy is stable. Sensitivity Analysis would prove it robust.
But, let’s suppose if another backtest would give these numbers:
0%, -42%, 0%, +42%, 173.75%
The mean of these tests are also: 34.75%. Would you trust this algorithm with your own money?

The point of the following 2 kind of analysis to get you enough confidence in the backtest results.

We try to estimate the NN true CAGR with some degree of confidence.

The 34.75% is only the sample mean. Let’s calculate the sample standard deviation (SSD) SSD = 1.14%.

This is from the excel table that summarizes it:

In human language form:
-“We can say with 68% confidence that the true population mean CAGR is between 33.60% and 35.89%”

-“We can say with 95% confidence that the true population mean CAGR is between 32.46% and 37.04%”

-“We can say with 99% confidence that the true population mean CAGR is between 31.32% and 38.18%”

That looks good, because even assuming the worst = 31.32% (with 99% confidence), we have positive CAGR, we don’t lose money.

However, even with that, there is 1% chance (once in 100 times) that the CAGR is not in that range, so we are not as profitable as we expect.

And for the sake of completeness, a very useful statement:
-“We can say with 100% confidence that the true population mean CAGR is between -infinity and + infinity”

So, we can statistically prove with 95% confidence that the strategy is good, but: who cares? It is only playing with numbers. Real life doesn’t bother obeying our proof.
In real life you should expect much worse performance with much worse confidence.
I guess if you can prove with math tools that it works 95% of the time, in real life it works only 80% of the time.
That is because the synthetic mathematic rules, assumptions (like independent, Gaussian random variables) are so far from real life.
(And partly, another reason is the backtest bias, but the backtest bias makes the CAGR performance worse, and not the confidence worse.)

So, one can ask the question: why bother at all with the task of proving that the strategy works. We can prove it, but as we know how unusable it is, why should we care?
I contribute a quote to George Soros, “If it works, do it.”. Don’t bother too much trying to prove it mathematically.
So, I guess it is better to spend our times on experiments, simulations than trying to think about the reasons and prove theoretically why a strategy works.

The conclusion of this way of thinking was that
We can say with 99% confidence that the true population mean CAGR is between 31.32% and 38.18%. Since even the lower interval is a positive number, we
are sure (99% sure) that our strategy has same alpha (profit edge), so we are confident to start this strategy in real life.

Let’s go another way. Instead of confidence intervals, use hypothesis testing:
Let’s form a Null Hypothesis: H_0: assume gCAGR = 0%. (or gCAGR <= 0%)
This means that our strategy is not better than random.
The Alternative Hypothesis: H_A: gCAGR > 0%. That our strategy has a genuine prediction power.

The question we try to answer is
Assuming H_0 is true, what is the chance of having 34.75% as the sample mean? Could it occur by chance?
If the chance is too low, we can disprove the Null Hypothesis H_0, and prove the alternative hypothesis.

Pr(X_avg > 34.75% | gCAGR= 0%) = ?

Let’s calculate the Z value of the statistic.
Assuming gCAGR = 0% for the population mean, the sample mean would have the same expected value.
Z_value = (34.75% – sample mean) / (SSD / squared(13)) = (34.75% – 0%)/ (1.14%/3.6)= 109.7

Have you ever seen a Z score like that in your life? 🙂
Pr(Z_value > 109.7 | gCAGR= 0%) = 0.
Actually I couldn’t find any software package that could calculate this number. They usually say = 0.
After Z score = 4 or 5, the chance is so miniscule that it is just not possible.

We wanted a 1% significance level, but this probability is much smaller than the 1%. (it is virtually 0).
So, there is virtually 0% chance that having true gCAGR= 0%, we observed these 13 backtest results.
Therefore, we reject the Null Hypothesis, and we are happy that our strategy is genuinely profitable. (gCAGR > 0%).

However, even if it is true, it only state that the true population mean (gCAGR) > 0%, but it doesn’t prove anything about the future. Stock markets may close down. In that case: no profit. Armageddon: no profit. We contend something about the past only. Also, even if the past continues more or less in the same way in the future, we know nothing about what will happen next year.
Maybe the strategy will be profitable if we play it long enough (> 20 years), maybe that 34.75% profit is contributed mostly for only some unique years (like 2008).
As these years may be not repeated in the future, we hardly know anything about the future potential of the strategy. We can only acquire information and derive conclusions about the past potential.

This is partly why money management techniques have to be used with any real life played strategy. Because proven successful strategies in the past may stop working in the future. Techniques like ‘playing the equity curve’ will signal alerts to terminate a strategy when it is likely that it stopped working.

We reviewed 2 different ways of answering the question: Is this 34.75% a good enough number?
A. with confidence intervals, and
B. with hypothesis testing.
We prefer the confidence intervals way of thinking, because it expresses a concrete range with lower and upper bounds, while hypothesis testing has only a binary answer: it is possible or not.