This is true for any sample you draw from the population again, except when the sample mean happens to be the same as the population mean. The difference is small now, but using the sample mean still results in a smaller sum compared to using the population mean. In short, the source of the bias comes from using the sample mean instead of the population mean. The sample mean is always guaranteed to be in the middle of the observed data, thereby reducing the variance, and creating an underestimation.
Now that we know that the bias is caused by using the sample mean, we can figure out how to solve the problem. Looking at the previous graphs, we see that if the sample mean is far from the population mean, the sample variance is smaller and the bias is large.
If the sample mean is close to the population mean, the sample variance is larger and the bias is small. So, the more the sample mean moves around the population mean, the greater the bias. In other words, besides the variance of the data points around the sample mean, there is also the variance of the sample mean around the population mean. We need both variances in order to accurately estimate the population variance. For that we need to know how to calculate the variance of the sample mean around the population mean.
This makes sense because the greater the variance in the population, the more the mean can jump around, but the more data you sample, the closer you get to the population mean. Now that we can calculate both the variance of the sample and the variance of the sample mean, we can check whether adding them together results in the population variance.
Below I show a graph in which I again sampled from our population with varying sample sizes. I did this times per sample size, took the average of each and put them on top of each other. Indeed, we see that the variance of the sample and the variance of the mean of the sample together form the population variance. Now that we know that the variance of the population consists of the variance of the sample and the variance of the sample mean, we can figure out the correction factor we need to apply to make the biased variance measure unbiased.
Previously, we found an interesting pattern in the simulated samples, which is also visible in the previous figure. But where does this correction factor come from? Well, because the sample variance misses the variance of the sample mean, we can expect that the variance of the sample is biased by an amount equal to the variance of the population minus the variance of the sample mean. In other words:. So, an unbiased measure of our sample variance is the biased sample variance times the correction factor:.
The correction factor corrects for this underestimation, producing an unbiased estimate of the population variance. Here I cheat a little because in order to calculate the variance of the sample mean, I need to use the unbiased variance formula. One way is the biased sample variance, the non unbiased estimator of the population variance. And that's denoted, usually denoted, by s with a subscript n.
And what is the biased estimator, how we calculate it? Well, we would calculate it very similar to how we calculated the variance right over here. But what we would do it for our sample, not our population. So for every data point in our sample --so we have n of them-- we take that data point. And from it, we subtract our sample mean.
We subtract our sample mean, square it, and then divide by the number of data points that we have. But we already talked about it in the last video. How would we find-- what is our best unbiased estimate of the population variance? This is usually what we're trying to get at. We're trying to find an unbiased estimate of the population variance. Well, in the last video, we talked about that, if we want to have an unbiased estimate --and here, in this video, I want to give you a sense of the intuition why.
We would take the sum. So we're going to go through every data point in our sample. We're going to take that data point, subtract from it the sample mean, square that.
But instead of dividing by n, we divide by n minus 1. We're dividing by a smaller number. And when you divide by a smaller number, you're going to get a larger value. So this is going to be larger. This is going to be smaller.
And this one, we refer to the unbiased estimate. And this one, we refer to the biased estimate. If people just write this, they're talking about the sample variance. It's a good idea to clarify which one they're talking about. But if you had to guess and people give you no further information, they're probably talking about the unbiased estimate of the variance. So you'd probably divide by n minus 1. But let's think about why this estimate would be biased and why we might want to have an estimate like that is larger.
And then maybe in the future, we could have a computer program or something that really makes us feel better, that dividing by n minus 1 gives us a better estimate of the true population variance. So let's imagine all the data in a population. And I'm just going to plot them on number a line. So this is my number line. This is my number line.
And let me plot all the data points in my population. So this is some data. This is some data. Here's some data. And here is some data here. And I can just do as many points as I want. Connect and share knowledge within a single location that is structured and easy to search.
When you divide by a smaller number you get a larger number. Let's think about what a larger vs. If the sample variance is larger than there is a greater chance that it captures the true population variance. Because we are trying to reveal information about a population by calculating the variance from a sample set we probably do not want to underestimate the variance. There was a good post here on CV that will give you some good insight.
Hope this helps! Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more.
0コメント