Cartoon Guide to Statistics

30Jul11

Instead of backtesting strategies in this post, let’s turn our attention to a branch of mathematical science called statistics.

1. The Book

I can wholeheartedly suggest a book called
Cartoon Guide to Statistics .

There is another similar piece of work,
The Manga Guide to Statistics
I took a quick look, but its story is built around a romantic relationship of a young Japanese girl.
It is definitely not how I would like to see one of the most difficult and serious part of the mathematical science to be presented.
Never mind, manga fans may be interested in it.

The Cartoon Guide to Statistics is a very well organized book. It touches almost all part of statistics, but of course it cannot go very deeply into the topics. It is funny, amusing, it is enjoy to read.
I contend that an average secondary student should have no problem with the difficulty, albeit some reviewers (interestingly UK reviewers mostly) complained that the book is too complex, and they couldn’t follow it. Trust me; it is a very easy book.
Strongly recommended for all math students. (Hopefully in the first year at the university)

However, reading the book that very nicely summarized what statistical science is left me with an uneasy feeling.
I hope I don’t offend anybody, but to me (very personal opinion), the whole framework of the statistical tools looks only as a mathematical toy. However, in real life it doesn’t work, it cannot be used, it cannot be trusted. We can play with it, as we play with toys, but what for?
It is a nice math framework, but real life doesn’t play by these rules that are defined in statistics.

Instead of talking vaguely why I despise it, I try to give some concrete examples.

1.1
One of the tools that I think is a joke is called re-sampling.
A technique that treats the sample as if it were the population. It has other names like randomization, jackknife, bootstrapping. To me, it looks as a funny, but surely non-working tool in real life. Yes, it is true that you can prove mathematically, that it works, but in real life, would you use it?
Let’s suppose you watched how a stock was traded for a week. One week is clearly not enough to draw statistical or any conclusions that you would trust to risk your own money.
Now, clever mathematicians invent a tool called resampling (or bootstrapping, whatever).
Based on that 1 week observation, that has only 5 samples, they generate another 500 samples.
Now, you have 500 samples. So, you can make reliable statistical conclusions!… or not.
It can be proven mathematically that your generated 500 sample are unbiased estimates for the other non observed samples.However, those 5 initial samples based only on that 1 observed week. Maybe that week was the Xmas week. Even if you generate 500 samples from it, would you risk your money on non-Xmas weeks based on the 500 artificially generated samples?
No.
However, as the unbiased nature can be proven mathematically, you have a false sense of confidence in your method.

1.2
Another suspicion: a typical thing in statistics, that they start the chapter by ‘let’s suppose that the results of each trial are independent’.
What?
Where in life can you find something that you measure many times and they are independent? Yes, maybe the rolling dice (maybe even that is not true), but what about the stock market? Can you assume that the daily stock returns are independent from each other?
You can assume it; at your own peril! Because it is not true.
Then if it is clear that statistics can work only artificial, nonexistent scenarios, why should we use it at all?

1.3.
Another simplification: let’s say we make a statistic about the repair cost of crashed cars.
We take a sample of 10 crashed cars example. Can we assume that the repair cost samples are independent, normally distributed samples of a random variable in real life?
Absolutely not. Why would they be independent? Assume there was a heat wave last week in the country. Half of the car crashed related to this event. Cars without air conditioner (more tired drivers) are represented more. Are the repair cost of the air-conditioned and non-air conditioned cars different? Yes, non air conditioned cars are usually cheaper, less costly to mend. These 10 sample cars may be not independent.
Is the other assumption, that the repair cost is normally distributed true? Absolutely not. If the engine of the car is damaged, the repair cost is much higher.
Therefore the distribution probably has at least 2 peaks. One for the cases when the engine is not damaged; another for the engine damaged cars.
So, it is nice to use statistics for real life, but be aware that you make many assumptions that are simply not true, and the mathematical tools were not designed to be used with these samples.
At least, don’t expect a correct answer from your statistician.

1.4.
The book starts with a sentence:
“Statistics quantifies uncertanity. To make categorical statements, with complete assurance about their level of uncertanity.”
Complete assurance…
That is a joke.
Maybe, in the case of a rolling dice. Because a ‘perfect’, non-tempered rolling dice behaves nicely according to the Gaussian distribution.
But what about real life? Like the stock market. Stock prices are very far from behaving nicely.
Someone can say that, OK, don’t use the Gaussian statistics, use the power distribution.
But do you honestly believe that a stock price behaves according to the power law? No.
My firm belief is that there is no mathematical formula that can describe that distribution.
The stock price distribution lives in its own world. It doesn’t obey the law of mathematics.

These are only 4 reasons why I have a feeling that statistics is a childish toy only. It is a tool that we use to trick ourselves that we can understand and describe the world.
With all its delicate details and mathematical legerdemain, statistics is just a clever game for kids:
it has its rules, you can use it to amuse yourself, you can thing that you are clever, because you use it, but it is far from being usable in real world situations.

Then why should we bother with it at all?
Maybe the answer lies in a quote from Einstein:
“One thing I have learned in a long life: that all our science (‘math’),
measured against reality is primitive and childlike
– and yet it is the most precious thing we have.”
I cannot concur more.

Or equivalently the quote from Einstein: “God does not play dice.”
My interpretation of this quote is that The Universe’s (-not God; Einstein was not religious-) works in a way that it cannot be described by simple probabilities of the rolling dice (which is the Gaussian distribution).
Some people shares this kind of interpretation with me, somebody interpreted it in a web forum as “probability / statistics is wholly inadequate to explain/model real world quantum effects”.
You may have the right to disagree with me in this interpretation. The most popular interpretation is that “The Universe is not random, but deterministic” which is a viable rendition too.

2. Bessel’s correction in SD

It is universal that people, who don’t understand statistics, try to use it. (Me among them). 🙂
A typical misunderstanding is in the standard deviation (SD) that whether using N or N-1 in the denominator.
The rule of thumb is that we divide by N in the population (or model) standard deviation and we divide by N-1 in the sample SD formulae.
When some (less mathematically educated people) see that there is N-1 used in an equation in an article or a book, they even suggest with great confidence that it is wrong and the author made a mistake. (Because if they have only 1 sample, then we should divide by zero.)
However, they are wrong.

Define 2 statistics:

A. ‘standard deviation of the sample’ (SDoS)
This one is using N in the denominator. However this estimator, when applied to a small or moderately sized sample, tends to be too low: it is a biased estimator.

B. ‘sample standard deviation’ (SSD)
This one is using N-1. This is the most commonly used, the adjusted version.
This correction (the use of N – 1 instead of N) is known as Bessel’s correction. The reason for this correction is that SSD^2 is an unbiased estimator for the variance of the underlying population. (note, even SSD is not an unbiased estimator for the population SD, only the SSD^2 is the unbiased estimator for the population variance.)
Bessel’s correction corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation.

So, the term standard deviation of the sample (SDoS) is used for the uncorrected estimator (using N) while the term sample standard deviation (SSD) is used for the corrected estimator (using N – 1). The denominator N – 1 is the number of degrees of freedom in the vector of residuals.

”
That is, when estimating the population variance and standard deviation from a sample when the population mean is unknown, the sample variance is a biased estimator of the population variance, and systematically underestimates it. Multiplying the standard sample variance by n/(n – 1) (equivalently, using 1/(n – 1) instead of 1/n) corrects for this, and gives an unbiased estimator of the population variance.

A subtle point is that, while the sample variance (using Bessel’s correction) is an unbiased estimate of the population variance, its square root, the sample standard deviation, is a biased estimate of the population standard deviation; because the square root is a concave function, the bias is downward, by Jensen’s inequality. There is no general formula for an unbiased estimator of the population standard deviation.
”

One can understand Bessel’s correction intuitively as the degrees of freedom in the residuals vector:
(X1-X_avg,
X2-X_avg,
…
Xn-X_avg)

where X_avg is the sample mean. While there are n independent samples, there are only n – 1 independent residuals, as they sum to 0

In intuitive terms, we are seeking the sum of squared distances from the population mean, but end up calculating the sum of squared differences from the sample mean which is (in effect) defined as the position which is closest to all the data points, i.e. it is the point that “minimizes that sum of squared distances.”
This estimate will always underestimate the population variance. Because it is a minimization of something. So, it is another way to see that the SDoS understate the population variance.

I illustrate the necessity of using N-1 in the denominator in the proof that shows that the SSD^2 is an unbiased estimate of the population variance.

If we put N, instead of N-1, the estimator is not unbiased.
The problem as you see is that the samples are also used to generate the sample mean (as an estimator for the population mean), and that will subtract one Sigma^2 from the variance.

– Ok, so we see that if we want unbiased estimation for the variance, we can have it (using N-1 in the denominator). However, this estimator is not unbiased for the SD.
So, either we use N or N-1 in the SD formula, it doesn’t matter, because none of them will be unbiased (because square root is concave).
But it is not too difficult to see that using N-1 is more accurate, so we should use it not only for the variance, but for the SD too.

What usually bothers people using N-1 is that if we have only 1 sample, then it is not possible to estimate the SD of the sample or the SD of the population. Because: division by zero.
And then they think the problem is in the formula, and we should use N instead of N-1.
However, it is not true. The formula is correct.
The problem is in their thinking. If we have only 1 sample, it is really not possible to estimate the population mean and the population SD at the same time.
It is crazy to expect that it is possible at all.
In estimating the SD, we have to estimate the population mean first.
So, it is really true: if we have only one sample, we can estimate one thing only: the mean.
With that estimation, we used all our 1 degree of freedom (all our data), and there is no extra information to use for estimating the variance.

Let’s suppose you have only 1 sample, and try to use the N version for calculating the variance.
What will be the result: Variance = (X1-Mean)^2. Because mean = X1, it will be zero.
Do you really accept that this constant zero is a good estimate for the population variance? (Independent of the sample)
Absolutely not. It is much better to say that we cannot calculate the variance/SD than estimating a variance (the zero) that we know for sure that is wrong.

This should be treated in our mind as the division by zero case. Division of any number by zero is not defined. Similarly,
SD of any sample having only 1 observation is not defined. Get used to it.
Algorithms, programs should return NaN (Not a Number), instead of 0 in that case.

3. Confidence in our backtested CAGR

Let’s use statistics for gaining some confidence in our Neural Network prediction system.
Our Neural Network algorithm is not a deterministic one, because of the random initialization of the NN weights. The backtest is a random variable. Its random results fluctuate around the population mean, the true expected value of the backtest. The backtests run for the last 23 years. ANN (T-0, T-1) version.
We did run 13 backtest experiments and found the following annual %profits (gCAGR):
35.08%, 33.63%, 33.29%, 37.00%, 35.02%, 35.68%, 36.14%, 33.17%, 33.94%, 34.50%, 34.04%, 35.49%, 34.75%.

The arithmetic average is 34.75%. That is the sample mean. However, it is not the population mean: the true expected value.
Is this 34.75% a good number? Should we be happy about it? Should we trust enough to play it in real life?

There are 2 ways to answer the question: is 34.75% gCAGR good enough or not? (Both approaches are viable to answer this question)
A. with confidence intervals: e.g. with 95% confidence, we can say that the true gCAGR is between X, Y
B. with hypothesis testing: Assume gCAGR=0% (=H_0); what is the chance of having 34.75% as the sample mean? Can we disprove H_0?

Note that intuitively, looking at these 13 numbers, we feel that the strategy is stable. Sensitivity Analysis would prove it robust.
But, let’s suppose if another backtest would give these numbers:
0%, -42%, 0%, +42%, 173.75%
The mean of these tests are also: 34.75%. Would you trust this algorithm with your own money?

The point of the following 2 kind of analysis to get you enough confidence in the backtest results.

A.
We try to estimate the NN true CAGR with some degree of confidence.
The 34.75% is only the sample mean. Let’s calculate the sample standard deviation (SSD) SSD = 1.14%.

This is from the excel table that summarizes it:

In human language form:
-“We can say with 68% confidence that the true population mean CAGR is between 33.60% and 35.89%”

-“We can say with 95% confidence that the true population mean CAGR is between 32.46% and 37.04%”

-“We can say with 99% confidence that the true population mean CAGR is between 31.32% and 38.18%”
That looks good, because even assuming the worst = 31.32% (with 99% confidence), we have positive CAGR, we don’t lose money.

However, even with that, there is 1% chance (once in 100 times) that the CAGR is not in that range, so we are not as profitable as we expect.

And for the sake of completeness, a very useful statement:
-“We can say with 100% confidence that the true population mean CAGR is between -infinity and + infinity”
🙂

So, we can statistically prove with 95% confidence that the strategy is good, but: who cares? It is only playing with numbers. Real life doesn’t bother obeying our proof.
In real life you should expect much worse performance with much worse confidence.
I guess if you can prove with math tools that it works 95% of the time, in real life it works only 80% of the time.
That is because the synthetic mathematic rules, assumptions (like independent, Gaussian random variables) are so far from real life.
(And partly, another reason is the backtest bias, but the backtest bias makes the CAGR performance worse, and not the confidence worse.)

So, one can ask the question: why bother at all with the task of proving that the strategy works. We can prove it, but as we know how unusable it is, why should we care?
I contribute a quote to George Soros, “If it works, do it.”. Don’t bother too much trying to prove it mathematically.
So, I guess it is better to spend our times on experiments, simulations than trying to think about the reasons and prove theoretically why a strategy works.

The conclusion of this way of thinking was that
We can say with 99% confidence that the true population mean CAGR is between 31.32% and 38.18%. Since even the lower interval is a positive number, we
are sure (99% sure) that our strategy has same alpha (profit edge), so we are confident to start this strategy in real life.

B.
Let’s go another way. Instead of confidence intervals, use hypothesis testing:
Let’s form a Null Hypothesis: H_0: assume gCAGR = 0%. (or gCAGR <= 0%)
This means that our strategy is not better than random.
The Alternative Hypothesis: H_A: gCAGR > 0%. That our strategy has a genuine prediction power.

The question we try to answer is
Assuming H_0 is true, what is the chance of having 34.75% as the sample mean? Could it occur by chance?
If the chance is too low, we can disprove the Null Hypothesis H_0, and prove the alternative hypothesis.
Formally:

Pr(X_avg > 34.75% | gCAGR= 0%) = ?

Let’s calculate the Z value of the statistic.
Assuming gCAGR = 0% for the population mean, the sample mean would have the same expected value.
Z_value = (34.75% – sample mean) / (SSD / squared(13)) = (34.75% – 0%)/ (1.14%/3.6)= 109.7

Have you ever seen a Z score like that in your life? 🙂
Pr(Z_value > 109.7 | gCAGR= 0%) = 0.
Actually I couldn’t find any software package that could calculate this number. They usually say = 0.
After Z score = 4 or 5, the chance is so miniscule that it is just not possible.

We wanted a 1% significance level, but this probability is much smaller than the 1%. (it is virtually 0).
So, there is virtually 0% chance that having true gCAGR= 0%, we observed these 13 backtest results.
Therefore, we reject the Null Hypothesis, and we are happy that our strategy is genuinely profitable. (gCAGR > 0%).

However, even if it is true, it only state that the true population mean (gCAGR) > 0%, but it doesn’t prove anything about the future. Stock markets may close down. In that case: no profit. Armageddon: no profit. We contend something about the past only. Also, even if the past continues more or less in the same way in the future, we know nothing about what will happen next year.
Maybe the strategy will be profitable if we play it long enough (> 20 years), maybe that 34.75% profit is contributed mostly for only some unique years (like 2008).
As these years may be not repeated in the future, we hardly know anything about the future potential of the strategy. We can only acquire information and derive conclusions about the past potential.

This is partly why money management techniques have to be used with any real life played strategy. Because proven successful strategies in the past may stop working in the future. Techniques like ‘playing the equity curve’ will signal alerts to terminate a strategy when it is likely that it stopped working.

We reviewed 2 different ways of answering the question: Is this 34.75% a good enough number?
A. with confidence intervals, and
B. with hypothesis testing.
We prefer the confidence intervals way of thinking, because it expresses a concrete range with lower and upper bounds, while hypothesis testing has only a binary answer: it is possible or not.

Filed under: Uncategorized | Leave a Comment

No Responses Yet to “Cartoon Guide to Statistics”

Feed for this Entry Trackback Address

Leave a Comment