When someone quotes the rate of return on a specific investment or quotes average returns on an index to support a theory behind achieving similar results in their investment, do you actually know what exactly they are talking about? Further more, what is the probability that you will in fact end up with what everyone else is quoting as “average?” The old joking definition of Statistics is: *generally everyone; specifically no one. *And this seems to be pretty fitting for a novice skill level in probability and statistics. However, once we get a little more technical, and employ a few more mathematical techniques, we start to explore what purpose these numbers really serve. We'll explore some useful applications of these figures in this post.

The mean, or average, is a calculated value that shows us the general tendency of a set of data (I'm being intentionally lazy here, if you have a PhD in Math, I'm fully aware of the lack in providing a more complete and exhaustive explanation, but this works for what I'm going to provide for our practical purposes). We derive this number by taking the given values of a set of data, and dividing it by the total number of observations (the data points) we have. Ex. the mean of the data set: 5, 6, 7, 8, 9 is 7 (5+6+7+8+9=35 =>35/5=7). This is referred to in higher math settings as the **Arithmetic Mean.**

On its own, the mean is generally a pretty useless number, though a lot of people insist on obnoxiously quoting it like it carries significant purpose. We've heard that the average rate of return for the S&P 500 index is 10%. The problem (and perhaps danger) of this sort of information, is that the applicability of this statistic is extremely limited, and often used to infer wildly incorrect things.

Does a 10% average rate of return on the S&P 500 index mean you can expect a 10% rate of return if you were invested in a fund that mocks the performance of the index?

Like most everything in life, the mean of a set of data is context to a larger idea, and when included with other pieces of information, becomes much more useful to you. The single most important additional piece of information of knowing a data set's mean is knowing its standard deviation (at this point I'd like to note that most of the time people work with sample means and sample deviations because knowing the entire population mean or standard deviation is impossible or prohibitively costly, I'm not going to worry about making that distinction throughout this post). The standard deviation tells us how localized the data is around a mean (e.g. is it tightly distributed or is there a lot of variation?). The standard deviation is critically important in putting together a probability distribution; this is very helpful in determining the probability of a certain observation in a set of data's being observed. Instead of diving into an overly complicated discussion about the construct of a probability function and differentiating between a continuous probability function and a probability density function, let's just understand the following:

On a normally distributed set of date (one that follows the commonly recognized Bell Curve) 95% of all data points lie within two standard deviation of the mean (left and right) and 99.7% of all data points lie with in 3 standard deviations of the mean (both left and right).

So, when someone states the average rate of return on the S&P 500 index is 10% a helpful follow up question is : What's the standard deviation? The answer happens to be 15%. This would mean you'd have a 99.7% chance of having a return fall between -35% and 55%.

Now let's equip you with a question that will be infinitely more helpful and staking out a decision given statistics that are handed to you. What's the probability of getting a certain value (exactly)? Calculating this value requires simply calculating the probability density function of the 10% rate of return based on what you know about the data (oh sure thing, just give me a second and I'll just fire up MATLAB…). Ok, easier said than done, and in truth the density function is pretty boring and not particularly useful for our purposes–the continuous function is way more helpful. The continuous function is useful because it will tell you the probability of getting “at most” a certain value (or less–and then subtracted from 1 to answer or more). Now, here's the really good news, you don't need to perform a thorough calculation of this function to get an acceptable answer (the real answer is an integral of the probability function, if you don't know what that means, don't worry about it, let's just leave it at it has something to do with Calculus). Instead, we can use a much simpler process that allows us to make predictions about what might happen given the parameters.

If your head just exploded during the last paragraph, don't worry, the technical crap is done, I got it out of my system.

Monte-Carlo Simulations sound impressive and complicated, but they aren't. The toughest part is randomizing the data, and any good number crunching software can do it (I used MS Excel for the the example I'm about to unleash on you). A Monte-Carlo Simulation takes the mean and standard deviation of a set of data, considers the distribution function of the data (e.g. Normal, Poisson, Bernouli , etc.) and runs through several scenarios of randomized data given the parameters. Results are then tabulated and a probability of a “success” is calculated (e.g. the probability of getting at least a 10% rate of return). Often times Monte-Carlo Simulations are used in personal finance to calculate the probability of success for a given activity, and the most widely used scenario is the probability of taking a certain percentage from one's savings given an assumed rate of return and not running out of money during a certain time period. Our example here won't go that far. We're concerned instead with the probability getting a certain rate of return on the S&P given the distribution that we understand it to have.

Taking a Monte-Carlo simulation of 1000 randomized observations given the above parameters for the S&P 500 over a 30 year period, we find that there's approximately a 50% probability of getting at least a 10% rate of return on average for all 30 years (nothing unusual here, the math works as it's supposed to). This elementary observation might shock people though, the truth behind that highly touted average 10% technically means there's a 50% chance of seeing at least a 10% return on the S&P 500. However, we'd be truly remiss if we closed the book at this point, for when it comes to the money in your account, the Arithmetic Mean means almost nothing. Though it's commonly quoted to confuse actual rates of return.

Question: the S&P drops 25% the first year you begin an investment and then rises 50% the following year, what is your rate of return (on your money) if you are in a fund that exactly matches the S&P in return?

If we take the arithmetic mean we'll get the following answer:

-25+50=25 => 25/2=12.5

And in terms of the arithmetic mean, this is correct, but it's not the actual year over year rate of return (the one we quote when we talk about inflation, and most other investment performance). For that, we need a different mean, the **Geometric Mean**.

For the Geometric Mean we need a slightly more cumbersome formula:

((y/x)^1/n)-1

Where:

y=value at end of our time period

x=value at beginning of time periods

n=number of time periods

If we run the following scenario though this calculation we get a rate of return of 6%, which is the year over year rate of return, commonly referred to as the **Compound Annual Growth Rate (CAGR)**.

And for those who enjoy a little multimedia explanation, here you go:

httpv://www.youtube.com/watch?v=dFf6ibuAl5w

In order to accurately describe the central question we asked in the beginning of this post (the probability of realizing an actual 10% year over year–CAGR–of 10%) we simply go back to the values in our Monte-Carlo Simulation and plug them into a second iteration of our simulation (it's not really a new simulation because we aren't randomizing anything here, we're simply plugging the values from the original simulation into another 1000 observations, this time seeing what those randomized rates of return do to an initial investment of $100,000 over a 30 year period).

When we now look at the rate of return from the perspective of CAGR, we see the probability of realizing at least a 10% rate of return is bit lower than the probability of a 10% return simply as the Arithmetic Mean–probability drops to 40.9%. The probability of squeaking out just one percentage point higher (or more) drops almost in half to 26.2% for at least 11% and 12% (a number Dave Ramsey has said is totally possible) drops down to 15.3%. Additionally, we don't get into >90% probability territory until we drop down to at least 5% CAGR (93.9%) and 4% gets us close to a sure thing at 98.6% probability.

A little longer this time around than preferred, but well worth the extra minute or two of reading I think. Oh and one more fun-fact-free-be for the probability world (one for the road). If we happen to run a simulation on withdrawals and end up with 95% probability of success (not running out of money) given a new portfolio and a specified withdrawal amount, what's the probability of achieving both the 10% rate of return (CAGR) and the success of not running out of money? It's approximately 16%. All you do to get this is multiply the probability of getting 10% by the the product of the probabilities of getting 10% and the probability of succeeding in not running out of money. So for those of you who think increasing your risk is an answer to not having enough money at retirement, you might want to re-think your available options.

Brandon launched the Insurance Pro Blog in July of 2011 as a project to de-mystify the life insurance industry. Brandon was born in Northern New England, and he currently calls VT home. He attended Syracuse University and graduated with a triple major in Economics, Public Administration, and Political Science.