I still believe, however, that diving into the details of measure theory is over-kill for introducing probability theory. I've been convinced by E.T. Jaynes that probability theory is an essential component of any education and as such should be presented in multiple ways at multiple times and certainly not thrown at you as "just an application of measure theory" the way it sometimes is in math courses. I think this is improving, but there is still work to do.
What typically still happens is that people get their "taste" of probability theory (or worse, their taste of "statistics") and then move on not ever really assimilating the lessons in their life. The trouble is everyone must deal with uncertainty. Our brains are hard-wired to deal with it --- often in ways that can be counter-productive. At its core, probability theory is just a consistent and logical way to deal with uncertainty using real numbers. In fact, it can be argued that it is the only way to deal with uncertainty.
I've done a bit of experimentation over the years and dealt with a lot of data (MRI, ultrasound, electrical impedance data). In probability theory, I found a framework for understanding what the data really tells me which led me to spend several years studying inverse problems. There are a lot of problems that can be framed as inverse problems. Basically, inverse problem theory can be applied to any problem where you have data and you want to understand what the data tells you. To apply probability theory to solve an inverse problem you have to have some model that determines how what you want to know leads to the data you've got. Then, you basically invert the model. Bayes' theorem provides a beautiful framework for this inversion.
The result of this Bayesian approach to inverse problems, though, is not just a number. It is explicitly a probability density function (or probability mass function). In other words, the result of a proper solution to an inverse problem is a random variable, or probability distribution. Seeing the result of any inverse problem as a random variable changes the way you think about drawing conclusions from data.
Think about the standard problem of fitting a line to data. You plug-and-chug using a calculator or a spreadsheet (or a function call in NumPy), and you get two numbers (the slope and intercept). If you properly understand inverse problems as requiring the production of a random variable, then you will not be satisfied with just these numbers. You will want to know, how certain am I about these numbers. How much should I trust them? What if I am going to make a decision on the basis of these numbers? (Speaking of making decisions, someday, I would like to write about how probability theory is also under-utilized in standard business financial projections and business decisions).
Some statisticians when faced with this regression problem will report the "goodness" of fit and feel satisfied, but as one who sees the power and logical simplicity of Bayesian inverse theory, I'm not satisfied by such an answer. What I want is the joint probability distribution for slope and intercept based on the data. A lot of common regression techniques do not provide this. I'm not going to go into details regarding the historical reasons for why this is. You can use google to explore some of the details if you are interested. A lot of it comes down to the myth of objectivity and the desire to eliminate the need for a prior which Bayesian inverse theory exposes.
As an once very active contributor to SciPy (now an occasional contributor who is still very interested in its progress), I put in the scipy.stats package a few years ago a little utility for estimating the mean, standard deviation, and variance from data that expresses my worldview a little bit. I recently updated this utility and created a function called mvsdist. This function finally returns random variable objects (as any good inverse problem solution should!) for the Mean, Standard deviation, and Variance derived from a vector of data. The assumptions are 1) the data were all sampled from a random variable with the same mean and variance, 2) the standard deviation and variance are "scale" parameters, and 3) non-informative (improper) priors.
The details of the derivation are recorded in this paper. Any critiques of this paper are welcome as I never took the time to try and get formal review for it (I'm not sure where I would have submitted it for one --- and I'm pretty sure there is a paper out there that already expresses all of this, anyway).
It is pretty simple to get started playing with mvsdist (assuming you have SciPy 0.9 installed). This function is meant to be called any time you have a bunch of data and you want to "compute the mean" or "find the standard deviation." You collect the data into a list or NumPy array of numbers and pass this into the mvsdist function:
>>> from scipy.stats import mvsdist >>> data = [9, 12, 10, 8, 6, 11, 7] >>> mean, var, std = mvsdist(data)
This returns three distribution objects which I have intentionally named mean, var, and std because they represent the estimates of mean, variance, and standard-deviation of the data. Because they are estimates, they are not just numbers, but instead are (frozen) probability distribution objects. These objects have methods that let you evaluate the probability density function: .pdf(x), compute the cumulative density function: .cdf(x), generate random samples drawn from the distribution: .rvs(size=N), determine an interval that contains some percentage of the random draws from this distribution: .interval(alpha), and calculate simple statistics: .stats(), .mean(), .std(), .var().
In this case, consider the following example:
>>> mean.interval(0.90) (7.4133999449331132, 10.586600055066887) >>> mean.mean() 9.0 >>> mean.std() 0.99999999999999989 >>> std.interval(0.90) (1.4912098929401241, 4.137798046658852) >>> std.mean() 2.4869681414837035 >>> std.std() 0.90276766847572409
Notice that once we have the probability distribution we can report many things about the estimate that provide for not only the estimate itself, but also any question we might have regarding the uncertainty in the estimate. Often, we may want to visualize the probability density function as is shown below for the standard deviation estimate and the mean estimate
It is not always easy to solve an inverse problem by providing the full probability distribution object (especially in multiple dimensions). But, when it's possible, it really does provide for a more thorough understanding of the problem. I'm very interested in SciPy growing more of these kinds of estimator approaches where possible.