In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased (see bias versus consistency for more).
All else being equal, an unbiased estimator is preferable to a biased estimator, although in practice, biased estimators (with generally small bias) are frequently used. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population; because an estimator is difficult to compute (as in unbiased estimation of standard deviation); because a biased estimator may be unbiased with respect to different measures of central tendency; because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful.
Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Mean-unbiasedness is not preserved under non-linear transformations, though median-unbiasedness is (see § Effect of transformations); for example, the sample variance is a biased estimator for the population variance. These are all illustrated below.
An unbiased estimator for a parameter need not always exist. For example, there is no unbiased estimator for the reciprocal of the parameter of a binomial random variable.
Definition
Suppose we have a statistical model, parameterized by a real number θ, giving rise to a probability distribution for observed data, , and a statistic which serves as an estimator of θ based on any observed data . That is, we assume that our data follows some unknown distribution (where θ is a fixed, unknown constant that is part of this distribution), and then we construct some estimator that maps observed data to values that we hope are close to θ. The bias of relative to is defined as
where denotes expected value over the distribution (i.e., averaging over all possible observations ). The second equation follows since θ is measurable with respect to the conditional distribution .
An estimator is said to be unbiased if its bias is equal to zero for all values of parameter θ, or equivalently, if the expected value of the estimator matches that of the parameter. Unbiasedness is not guaranteed to carry over. For example, if is an unbiased estimator for parameter θ, it is not guaranteed that g() is an unbiased estimator for g(θ).
In a simulation experiment concerning the properties of an estimator, the bias of the estimator may be assessed using the mean signed difference.
Examples
Sample variance
The sample variance of a random variable demonstrates two aspects of estimator bias: firstly, the naive estimator is biased, which can be corrected by a scale factor; second, the unbiased estimator is not optimal in terms of mean squared error (MSE), which can be minimized by using a different scale factor, resulting in a biased estimator with lower MSE than the unbiased estimator. Concretely, the naive estimator sums the squared deviations and divides by n, which is biased. Dividing instead by n − 1 yields an unbiased estimator. Conversely, MSE can be minimized by dividing by a different number (depending on distribution), but this results in a biased estimator. This number is always larger than n − 1, so this is known as a shrinkage estimator, as it "shrinks" the unbiased estimator towards zero; for the normal distribution the optimal value is n + 1.
Suppose X1, ..., Xn are independent and identically distributed (i.i.d.) random variables with expectation μ and variance σ2. If the sample mean and uncorrected sample variance are defined as
then S2 is a biased estimator of σ2, because
To continue, we note that by subtracting from both sides of , we get
Meaning, (by cross-multiplication) . Then, the previous becomes:
This can be seen by noting the following formula, which follows from the Bienaymé formula, for the term in the inequality for the expectation of the uncorrected sample variance above: .
In other words, the expected value of the uncorrected sample variance does not equal the population variance σ2, unless multiplied by a normalization factor. The sample mean, on the other hand, is an unbiased estimator of the population mean μ.
Note that the usual definition of sample variance is , and this is an unbiased estimator of the population variance.
Algebraically speaking, is unbiased because:
where the transition to the second line uses the result derived above for the biased estimator. Thus , and therefore is an unbiased estimator of the population variance, σ2. The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as Bessel's correction.
The reason that an uncorrected sample variance, S2, is biased stems from the fact that the sample mean is an ordinary least squares (OLS) estimator for μ: is the number that makes the sum as small as possible. That is, when any other number is plugged into this sum, the sum can only increase. In particular, the choice gives,
and then
The above discussion can be understood in geometric terms: the vector can be decomposed into the "mean part" and "variance part" by projecting to the direction of and to that direction's orthogonal complement hyperplane. One gets for the part along and for the complementary part. Since this is an orthogonal decomposition, Pythagorean theorem says , and taking expectations we get , as above (but times ). If the distribution of is rotationally symmetric, as in the case when are sampled from a Gaussian, then on average, the dimension along contributes to equally as the directions perpendicular to , so that and . This is in fact true in general, as explained above.
Estimating a Poisson probability
A far more extreme case of a biased estimator being better than any unbiased estimator arises from the Poisson distribution. Suppose that X has a Poisson distribution with expectation λ. Suppose it is desired to estimate
with a sample of size 1. (For example, when incoming calls at a telephone switchboard are modeled as a Poisson process, and λ is the average number of calls per minute, then e−2λ is the probability that no calls arrive in the next two minutes.)
Since the expectation of an unbiased estimator δ(X) is equal to the estimand, i.e.
the only function of the data constituting an unbiased estimator is
To see this, note that when decomposing e−λ from the above expression for expectation, the sum that is left is a Taylor series expansion of e−λ as well, yielding e−λe−λ = e−2λ (see Characterizations of the exponential function).
If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is very likely to be near 0, which is the opposite extreme. And, if X is observed to be 101, then the estimate is even more absurd: It is −1, although the quantity being estimated must be positive.
The (biased) maximum likelihood estimator
is far better than this unbiased estimator. Not only is its value always positive but it is also more accurate in the sense that its mean squared error
is smaller; compare the unbiased estimator's MSE of
The MSEs are functions of the true value λ. The bias of the maximum-likelihood estimator is:
Maximum of a discrete uniform distribution
The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random, giving a value X. If n is unknown, then the maximum-likelihood estimator of n is X, even though the expectation of X given n is only (n + 1)/2; we can be certain only that n is at least X and is probably more. In this case, the natural unbiased estimator is 2X − 1.
Median-unbiased estimators
The theory of median-unbiased estimators was revived by George W. Brown in 1947:
An estimate of a one-dimensional parameter θ will be said to be median-unbiased, if, for fixed θ, the median of the distribution of the estimate is at the value θ; i.e., the estimate underestimates just as often as it overestimates. This requirement seems for most purposes to accomplish as much as the mean-unbiased requirement and has the additional property that it is invariant under one-to-one transformation.
Further properties of median-unbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl. In particular, median-unbiased estimators exist in cases where mean-unbiased and maximum-likelihood estimators do not exist. They are invariant under one-to-one transformations.
There are methods of construction median-unbiased estimators for probability distributions that have monotone likelihood-functions, such as one-parameter exponential families, to ensure that they are optimal (in a sense analogous to minimum-variance property considered for mean-unbiased estimators). One such procedure is an analogue of the Rao–Blackwell procedure for mean-unbiased estimators: The procedure holds for a smaller class of probability distributions than does the Rao–Blackwell procedure for mean-unbiased estimation but for a larger class of loss-functions.
Bias with respect to other loss functions
Any minimum-variance mean-unbiased estimator minimizes the risk (expected loss) with respect to the squared-error loss function (among mean-unbiased estimators), as observed by Gauss. A minimum-average absolute deviation median-unbiased estimator minimizes the risk with respect to the absolute loss function (among median-unbiased estimators), as observed by Laplace. Other loss functions are used in statistics, particularly in robust statistics.
Effect of transformations
For univariate parameters, median-unbiased estimators remain median-unbiased under transformations that preserve order (or reverse order). Note that, when a transformation is applied to a mean-unbiased estimator, the result need not be a mean-unbiased estimator of its corresponding population statistic. By Jensen's inequality, a convex function as transformation will introduce positive bias, while a concave function will introduce negative bias, and a function of mixed convexity may introduce bias in either direction, depending on the specific function and distribution. That is, for a non-linear function f and a mean-unbiased estimator U of a parameter p, the composite estimator f(U) need not be a mean-unbiased estimator of f(p). For example, the square root of the unbiased estimator of the population variance is not a mean-unbiased estimator of the population standard deviation: the square root of the unbiased sample variance, the corrected sample standard deviation, is biased. The bias depends both on the sampling distribution of the estimator and on the transform, and can be quite involved to calculate – see unbiased estimation of standard deviation for a discussion in this case.
Bias, variance and mean squared error
While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample. An estimator that minimises the bias will not necessarily minimise the mean square error. One measure which is used to try to reflect both types of difference is the mean square error,
This can be shown to be equal to the square of the bias, plus the variance:
When the parameter is a vector, an analogous decomposition applies:
where is the trace (diagonal sum) of the covariance matrix of the estimator and is the square vector norm.
Example: Estimation of population variance
For example, suppose an estimator of the form
is sought for the population variance as above, but this time to minimise the MSE:
If the variables X1 ... Xn follow a normal distribution, then nS2/σ2 has a chi-squared distribution with n − 1 degrees of freedom, giving:
and so
With a little algebra it can be confirmed that it is c = 1/(n + 1) which minimises this combined loss function, rather than c = 1/(n − 1) which minimises just the square of the bias.
More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values.
However it is very common that there may be perceived to be a bias–variance tradeoff, such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.
Bayesian view
Most bayesians are rather unconcerned about unbiasedness (at least in the formal sampling-theory sense above) of their estimates. For example, Gelman and coauthors (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading."
Fundamentally, the difference between the Bayesian approach and the sampling-theory approach above is that in the sampling-theory approach the parameter is taken as fixed, and then probability distributions of a statistic are considered, based on the predicted sampling distribution of the data. For a Bayesian, however, it is the data which are known, and fixed, and it is the unknown parameter for which an attempt is made to construct a probability distribution, using Bayes' theorem:
Here the second term, the likelihood of the data given the unknown parameter value θ, depends just on the data obtained and the modelling of the data generation process. However a Bayesian calculation also includes the first term, the prior probability for θ, which takes account of everything the analyst may know or suspect about θ before the data comes in. This information plays no part in the sampling-theory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data. To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms.
But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an "uninformative" prior.
For example, consider again the estimation of an unknown population variance σ2 of a Normal distribution with unknown mean, where it is desired to optimise c in the expected loss function
A standard choice of uninformative prior for this problem is the Jeffreys prior, , which is equivalent to adopting a rescaling-invariant flat prior for ln(σ2).
One consequence of adopting this prior is that S2/σ2 remains a pivotal quantity, i.e. the probability distribution of S2/σ2 depends only on S2/σ2, independent of the value of S2 or σ2:
However, while
in contrast
— when the expectation is taken over the probability distribution of σ2 given S2, as it is in the Bayesian case, rather than S2 given σ2, one can no longer take σ4 as a constant and factor it out. The consequence of this is that, compared to the sampling-theory calculation, the Bayesian calculation puts more weight on larger values of σ2, properly taking into account (as the sampling-theory calculation cannot) that under this squared-loss function the consequence of underestimating large values of σ2 is more costly in squared-loss terms than that of overestimating small values of σ2.
The worked-out Bayesian calculation gives a scaled inverse chi-squared distribution with n − 1 degrees of freedom for the posterior probability distribution of σ2. The expected loss is minimised when cnS2 = <σ2>; this occurs when c = 1/(n − 3).
Even with an uninformative prior, therefore, a Bayesian calculation may not give the same expected-loss minimising result as the corresponding sampling-theory calculation.
See also
- Consistent estimator
- Efficient estimator
- Estimation theory
- Expected loss
- Expected value
- Loss function
- Minimum-variance unbiased estimator
- Omitted-variable bias
- Optimism bias
- Ratio estimator
- Statistical decision theory
Notes
- "For the binomial distribution, why does no unbiased estimator exist for $1/p$?". Mathematics Stack Exchange. Retrieved 2023-12-27.
- Kozdron, Michael (March 2016). "Evaluating the Goodness of an Estimator: Bias, Mean-Square Error, Relative Efficiency (Chapter 3)" (PDF). stat.math.uregina.ca. Retrieved 2020-09-11.
- Taylor, Courtney (January 13, 2019). "Unbiased and Biased Estimators". ThoughtCo. Retrieved 2020-09-12.
- Dekking, Michel, ed. (2005). A modern introduction to probability and statistics: understanding why and how. Springer texts in statistics. London [Heidelberg]: Springer. ISBN 978-1-85233-896-1.
- Richard Arnold Johnson; Dean W. Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall. ISBN 978-0-13-187715-3. Retrieved 10 August 2012.
- Romano, J. P.; Siegel, A. F. (1986). Counterexamples in Probability and Statistics. Monterey, California, USA: Wadsworth & Brooks / Cole. p. 168.
- Hardy, M. (1 March 2003). "An Illuminating Counterexample". American Mathematical Monthly. 110 (3): 234–238. arXiv:math/0206006. doi:10.2307/3647938. ISSN 0002-9890. JSTOR 3647938.
- Brown (1947), page 583
- Lehmann 1951; Birnbaum 1961; Van der Vaart 1961; Pfanzagl 1994
- Pfanzagl, Johann (1979). "On optimal median unbiased estimators in the presence of nuisance parameters". The Annals of Statistics. 7 (1): 187–193. doi:10.1214/aos/1176344563.
- Brown, L. D.; Cohen, Arthur; Strawderman, W. E. (1976). "A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications". Ann. Statist. 4 (4): 712–722. doi:10.1214/aos/1176343543.
- Dodge, Yadolah, ed. (1987). Statistical Data Analysis Based on the L1-Norm and Related Methods. Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. Amsterdam: North-Holland. ISBN 0-444-70273-3.
- Jaynes, E. T. (2007). Probability Theory : The Logic of Science. Cambridge: Cambridge Univ. Press. p. 172. ISBN 978-0-521-59271-0.
- Klebanov, Lev B.; Rachev, Svetlozar T.; Fabozzi, Frank J. (2009). "Loss Functions and the Theory of Unbiased Estimation". Robust and Non-Robust Models in Statistics. New York: Nova Scientific. ISBN 978-1-60741-768-2.
- Taboga, Marco (2010). "Lectures on probability theory and mathematical statistics".
- DeGroot, Morris H. (1986). Probability and Statistics (2nd ed.). Addison-Wesley. pp. 414–5. ISBN 0-201-11366-X. But compare it with, for example, the discussion in Casella; Berger (2001). Statistical Inference (2nd ed.). Duxbury. p. 332. ISBN 0-534-24312-6.
- Gelman, A.; et al. (1995). Bayesian Data Analysis. Chapman and Hall. p. 108. ISBN 0-412-03991-5.
References
- Brown, George W. "On Small-Sample Estimation." The Annals of Mathematical Statistics, vol. 18, no. 4 (Dec., 1947), pp. 582–585. JSTOR 2236236.
- Lehmann, E. L. (December 1951). "A General Concept of Unbiasedness". The Annals of Mathematical Statistics. 22 (4): 587–592. doi:10.1214/aoms/1177729549. JSTOR 2236928.
- Birnbaum, Allan (March 1961). "A Unified Theory of Estimation, I". The Annals of Mathematical Statistics. 32 (1): 112–135. doi:10.1214/aoms/1177705145.
{{cite journal}}
: CS1 maint: date and year (link) - Van der Vaart, H. R. (June 1961). "Some Extensions of the Idea of Bias". The Annals of Mathematical Statistics. 32 (2): 436–447. doi:10.1214/aoms/1177705051.
{{cite journal}}
: CS1 maint: date and year (link) - Pfanzagl, Johann (1994). Parametric Statistical Theory. Walter de Gruyter.
- Stuart, Alan; Ord, Keith; Arnold, Steven [F.] (2010). Classical Inference and the Linear Model. Kendall's Advanced Theory of Statistics. Vol. 2A. Wiley. ISBN 978-0-4706-8924-0..
- Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1993). Unbiased estimators and their applications. Vol. 1: Univariate case. Dordrect: Kluwer Academic Publishers. ISBN 0-7923-2382-3.
- Voinov, Vassily [G.]; Nikulin, Mikhail [S.] (1996). Unbiased estimators and their applications. Vol. 2: Multivariate case. Dordrect: Kluwer Academic Publishers. ISBN 0-7923-3939-8.
- Klebanov, Lev [B.]; Rachev, Svetlozar [T.]; Fabozzi, Frank [J.] (2009). Robust and Non-Robust Models in Statistics. New York: Nova Scientific Publishers. ISBN 978-1-60741-768-2.
External links
- "Unbiased estimator", Encyclopedia of Mathematics, EMS Press, 2001 [1994]
In statistics the bias of an estimator or bias function is the difference between this estimator s expected value and the true value of the parameter being estimated An estimator or decision rule with zero bias is called unbiased In statistics bias is an objective property of an estimator Bias is a distinct concept from consistency consistent estimators converge in probability to the true value of the parameter but may be biased or unbiased see bias versus consistency for more All else being equal an unbiased estimator is preferable to a biased estimator although in practice biased estimators with generally small bias are frequently used When a biased estimator is used bounds of the bias are calculated A biased estimator may be used for various reasons because an unbiased estimator does not exist without further assumptions about a population because an estimator is difficult to compute as in unbiased estimation of standard deviation because a biased estimator may be unbiased with respect to different measures of central tendency because a biased estimator gives a lower value of some loss function particularly mean squared error compared with unbiased estimators notably in shrinkage estimators or because in some cases being unbiased is too strong a condition and the only unbiased estimators are not useful Bias can also be measured with respect to the median rather than the mean expected value in which case one distinguishes median unbiased from the usual mean unbiasedness property Mean unbiasedness is not preserved under non linear transformations though median unbiasedness is see Effect of transformations for example the sample variance is a biased estimator for the population variance These are all illustrated below An unbiased estimator for a parameter need not always exist For example there is no unbiased estimator for the reciprocal of the parameter of a binomial random variable DefinitionSuppose we have a statistical model parameterized by a real number 8 giving rise to a probability distribution for observed data P8 x P x 8 displaystyle P theta x P x mid theta and a statistic 8 displaystyle hat theta which serves as an estimator of 8 based on any observed data x displaystyle x That is we assume that our data follows some unknown distribution P x 8 displaystyle P x mid theta where 8 is a fixed unknown constant that is part of this distribution and then we construct some estimator 8 displaystyle hat theta that maps observed data to values that we hope are close to 8 The bias of 8 displaystyle hat theta relative to 8 displaystyle theta is defined as Bias 8 8 Bias8 8 Ex 8 8 8 Ex 8 8 8 displaystyle operatorname Bias hat theta theta operatorname Bias theta hat theta operatorname E x mid theta hat theta theta operatorname E x mid theta hat theta theta where Ex 8 displaystyle operatorname E x mid theta denotes expected value over the distribution P x 8 displaystyle P x mid theta i e averaging over all possible observations x displaystyle x The second equation follows since 8 is measurable with respect to the conditional distribution P x 8 displaystyle P x mid theta An estimator is said to be unbiased if its bias is equal to zero for all values of parameter 8 or equivalently if the expected value of the estimator matches that of the parameter Unbiasedness is not guaranteed to carry over For example if 8 displaystyle hat theta is an unbiased estimator for parameter 8 it is not guaranteed that g 8 displaystyle hat theta is an unbiased estimator for g 8 In a simulation experiment concerning the properties of an estimator the bias of the estimator may be assessed using the mean signed difference ExamplesSample variance The sample variance of a random variable demonstrates two aspects of estimator bias firstly the naive estimator is biased which can be corrected by a scale factor second the unbiased estimator is not optimal in terms of mean squared error MSE which can be minimized by using a different scale factor resulting in a biased estimator with lower MSE than the unbiased estimator Concretely the naive estimator sums the squared deviations and divides by n which is biased Dividing instead by n 1 yields an unbiased estimator Conversely MSE can be minimized by dividing by a different number depending on distribution but this results in a biased estimator This number is always larger than n 1 so this is known as a shrinkage estimator as it shrinks the unbiased estimator towards zero for the normal distribution the optimal value is n 1 Suppose X1 Xn are independent and identically distributed i i d random variables with expectation m and variance s2 If the sample mean and uncorrected sample variance are defined as X 1n i 1nXiS2 1n i 1n Xi X 2 displaystyle overline X frac 1 n sum i 1 n X i qquad S 2 frac 1 n sum i 1 n big X i overline X big 2 qquad then S2 is a biased estimator of s2 because E S2 E 1n i 1n Xi X 2 E 1n i 1n Xi m X m 2 E 1n i 1n Xi m 2 2 X m Xi m X m 2 E 1n i 1n Xi m 2 2n X m i 1n Xi m 1n X m 2 i 1n1 E 1n i 1n Xi m 2 2n X m i 1n Xi m 1n X m 2 n E 1n i 1n Xi m 2 2n X m i 1n Xi m X m 2 displaystyle begin aligned operatorname E S 2 amp operatorname E left frac 1 n sum i 1 n big X i overline X big 2 right operatorname E bigg frac 1 n sum i 1 n bigg X i mu overline X mu bigg 2 bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n bigg X i mu 2 2 overline X mu X i mu overline X mu 2 bigg bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 frac 2 n overline X mu sum i 1 n X i mu frac 1 n overline X mu 2 sum i 1 n 1 bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 frac 2 n overline X mu sum i 1 n X i mu frac 1 n overline X mu 2 cdot n bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 frac 2 n overline X mu sum i 1 n X i mu overline X mu 2 bigg 8pt end aligned To continue we note that by subtracting m displaystyle mu from both sides of X 1n i 1nXi displaystyle overline X frac 1 n sum i 1 n X i we get X m 1n i 1nXi m 1n i 1nXi 1n i 1nm 1n i 1n Xi m displaystyle begin aligned overline X mu frac 1 n sum i 1 n X i mu frac 1 n sum i 1 n X i frac 1 n sum i 1 n mu frac 1 n sum i 1 n X i mu 8pt end aligned Meaning by cross multiplication n X m i 1n Xi m displaystyle n cdot overline X mu sum i 1 n X i mu Then the previous becomes E S2 E 1n i 1n Xi m 2 2n X m i 1n Xi m X m 2 E 1n i 1n Xi m 2 2n X m n X m X m 2 E 1n i 1n Xi m 2 2 X m 2 X m 2 E 1n i 1n Xi m 2 X m 2 E 1n i 1n Xi m 2 E X m 2 s2 E X m 2 1 1n s2 lt s2 displaystyle begin aligned operatorname E S 2 amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 frac 2 n overline X mu sum i 1 n X i mu overline X mu 2 bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 frac 2 n overline X mu cdot n cdot overline X mu overline X mu 2 bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 2 overline X mu 2 overline X mu 2 bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 overline X mu 2 bigg 8pt amp operatorname E bigg frac 1 n sum i 1 n X i mu 2 bigg operatorname E bigg overline X mu 2 bigg 8pt amp sigma 2 operatorname E bigg overline X mu 2 bigg left 1 frac 1 n right sigma 2 lt sigma 2 end aligned This can be seen by noting the following formula which follows from the Bienayme formula for the term in the inequality for the expectation of the uncorrected sample variance above E X m 2 1ns2 displaystyle operatorname E big overline X mu 2 big frac 1 n sigma 2 In other words the expected value of the uncorrected sample variance does not equal the population variance s2 unless multiplied by a normalization factor The sample mean on the other hand is an unbiased estimator of the population mean m Note that the usual definition of sample variance is S2 1n 1 i 1n Xi X 2 displaystyle S 2 frac 1 n 1 sum i 1 n X i overline X 2 and this is an unbiased estimator of the population variance Algebraically speaking E S2 displaystyle operatorname E S 2 is unbiased because E S2 E 1n 1 i 1n Xi X 2 nn 1E 1n i 1n Xi X 2 nn 1 1 1n s2 s2 displaystyle begin aligned operatorname E S 2 amp operatorname E left frac 1 n 1 sum i 1 n big X i overline X big 2 right frac n n 1 operatorname E left frac 1 n sum i 1 n big X i overline X big 2 right 8pt amp frac n n 1 left 1 frac 1 n right sigma 2 sigma 2 8pt end aligned where the transition to the second line uses the result derived above for the biased estimator Thus E S2 s2 displaystyle operatorname E S 2 sigma 2 and therefore S2 1n 1 i 1n Xi X 2 displaystyle S 2 frac 1 n 1 sum i 1 n X i overline X 2 is an unbiased estimator of the population variance s2 The ratio between the biased uncorrected and unbiased estimates of the variance is known as Bessel s correction The reason that an uncorrected sample variance S2 is biased stems from the fact that the sample mean is an ordinary least squares OLS estimator for m X displaystyle overline X is the number that makes the sum i 1n Xi X 2 displaystyle sum i 1 n X i overline X 2 as small as possible That is when any other number is plugged into this sum the sum can only increase In particular the choice m X displaystyle mu neq overline X gives 1n i 1n Xi X 2 lt 1n i 1n Xi m 2 displaystyle frac 1 n sum i 1 n X i overline X 2 lt frac 1 n sum i 1 n X i mu 2 and then E S2 E 1n i 1n Xi X 2 lt E 1n i 1n Xi m 2 s2 displaystyle begin aligned operatorname E S 2 amp operatorname E bigg frac 1 n sum i 1 n X i overline X 2 bigg lt operatorname E bigg frac 1 n sum i 1 n X i mu 2 bigg sigma 2 end aligned The above discussion can be understood in geometric terms the vector C X1 m Xn m displaystyle vec C X 1 mu ldots X n mu can be decomposed into the mean part and variance part by projecting to the direction of u 1 1 displaystyle vec u 1 ldots 1 and to that direction s orthogonal complement hyperplane One gets A X m X m displaystyle vec A overline X mu ldots overline X mu for the part along u displaystyle vec u and B X1 X Xn X displaystyle vec B X 1 overline X ldots X n overline X for the complementary part Since this is an orthogonal decomposition Pythagorean theorem says C 2 A 2 B 2 displaystyle vec C 2 vec A 2 vec B 2 and taking expectations we get ns2 nE X m 2 nE S2 displaystyle n sigma 2 n operatorname E left overline X mu 2 right n operatorname E S 2 as above but times n displaystyle n If the distribution of C displaystyle vec C is rotationally symmetric as in the case when Xi displaystyle X i are sampled from a Gaussian then on average the dimension along u displaystyle vec u contributes to C 2 displaystyle vec C 2 equally as the n 1 displaystyle n 1 directions perpendicular to u displaystyle vec u so that E X m 2 s2n displaystyle operatorname E left overline X mu 2 right frac sigma 2 n and E S2 n 1 s2n displaystyle operatorname E S 2 frac n 1 sigma 2 n This is in fact true in general as explained above Estimating a Poisson probability A far more extreme case of a biased estimator being better than any unbiased estimator arises from the Poisson distribution Suppose that X has a Poisson distribution with expectation l Suppose it is desired to estimate P X 0 2 e 2l displaystyle operatorname P X 0 2 e 2 lambda quad with a sample of size 1 For example when incoming calls at a telephone switchboard are modeled as a Poisson process and l is the average number of calls per minute then e 2l is the probability that no calls arrive in the next two minutes Since the expectation of an unbiased estimator d X is equal to the estimand i e E d X x 0 d x lxe lx e 2l displaystyle operatorname E delta X sum x 0 infty delta x frac lambda x e lambda x e 2 lambda the only function of the data constituting an unbiased estimator is d x 1 x displaystyle delta x 1 x To see this note that when decomposing e l from the above expression for expectation the sum that is left is a Taylor series expansion of e l as well yielding e le l e 2l see Characterizations of the exponential function If the observed value of X is 100 then the estimate is 1 although the true value of the quantity being estimated is very likely to be near 0 which is the opposite extreme And if X is observed to be 101 then the estimate is even more absurd It is 1 although the quantity being estimated must be positive The biased maximum likelihood estimator e 2X displaystyle e 2 X quad is far better than this unbiased estimator Not only is its value always positive but it is also more accurate in the sense that its mean squared error e 4l 2el 1 e2 3 el 1 e4 1 displaystyle e 4 lambda 2e lambda 1 e 2 3 e lambda 1 e 4 1 is smaller compare the unbiased estimator s MSE of 1 e 4l displaystyle 1 e 4 lambda The MSEs are functions of the true value l The bias of the maximum likelihood estimator is e 2l el 1 e2 1 displaystyle e 2 lambda e lambda 1 e 2 1 Maximum of a discrete uniform distribution The bias of maximum likelihood estimators can be substantial Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random giving a value X If n is unknown then the maximum likelihood estimator of n is X even though the expectation of X given n is only n 1 2 we can be certain only that n is at least X and is probably more In this case the natural unbiased estimator is 2X 1 Median unbiased estimatorsThe theory of median unbiased estimators was revived by George W Brown in 1947 An estimate of a one dimensional parameter 8 will be said to be median unbiased if for fixed 8 the median of the distribution of the estimate is at the value 8 i e the estimate underestimates just as often as it overestimates This requirement seems for most purposes to accomplish as much as the mean unbiased requirement and has the additional property that it is invariant under one to one transformation Further properties of median unbiased estimators have been noted by Lehmann Birnbaum van der Vaart and Pfanzagl In particular median unbiased estimators exist in cases where mean unbiased and maximum likelihood estimators do not exist They are invariant under one to one transformations There are methods of construction median unbiased estimators for probability distributions that have monotone likelihood functions such as one parameter exponential families to ensure that they are optimal in a sense analogous to minimum variance property considered for mean unbiased estimators One such procedure is an analogue of the Rao Blackwell procedure for mean unbiased estimators The procedure holds for a smaller class of probability distributions than does the Rao Blackwell procedure for mean unbiased estimation but for a larger class of loss functions Bias with respect to other loss functionsAny minimum variance mean unbiased estimator minimizes the risk expected loss with respect to the squared error loss function among mean unbiased estimators as observed by Gauss A minimum average absolute deviation median unbiased estimator minimizes the risk with respect to the absolute loss function among median unbiased estimators as observed by Laplace Other loss functions are used in statistics particularly in robust statistics Effect of transformationsFor univariate parameters median unbiased estimators remain median unbiased under transformations that preserve order or reverse order Note that when a transformation is applied to a mean unbiased estimator the result need not be a mean unbiased estimator of its corresponding population statistic By Jensen s inequality a convex function as transformation will introduce positive bias while a concave function will introduce negative bias and a function of mixed convexity may introduce bias in either direction depending on the specific function and distribution That is for a non linear function f and a mean unbiased estimator U of a parameter p the composite estimator f U need not be a mean unbiased estimator of f p For example the square root of the unbiased estimator of the population variance is not a mean unbiased estimator of the population standard deviation the square root of the unbiased sample variance the corrected sample standard deviation is biased The bias depends both on the sampling distribution of the estimator and on the transform and can be quite involved to calculate see unbiased estimation of standard deviation for a discussion in this case Bias variance and mean squared errorSampling distributions of two alternative estimators for a parameter b0 Although b1 is unbiased it is clearly inferior to the biased b2 Ridge regression is one example of a technique where allowing a little bias may lead to a considerable reduction in variance and more reliable estimates overall While bias quantifies the average difference to be expected between an estimator and an underlying parameter an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample An estimator that minimises the bias will not necessarily minimise the mean square error One measure which is used to try to reflect both types of difference is the mean square error MSE 8 E 8 8 2 displaystyle operatorname MSE hat theta operatorname E big hat theta theta 2 big This can be shown to be equal to the square of the bias plus the variance MSE 8 E 8 8 2 E 8 E 8 2 Bias 8 8 2 Var 8 displaystyle begin aligned operatorname MSE hat theta amp operatorname E hat theta theta 2 operatorname E hat theta operatorname E hat theta 2 amp operatorname Bias hat theta theta 2 operatorname Var hat theta end aligned When the parameter is a vector an analogous decomposition applies MSE 8 trace Cov 8 Bias 8 8 2 displaystyle operatorname MSE hat theta operatorname trace operatorname Cov hat theta left Vert operatorname Bias hat theta theta right Vert 2 where trace Cov 8 displaystyle operatorname trace operatorname Cov hat theta is the trace diagonal sum of the covariance matrix of the estimator and Bias 8 8 2 displaystyle left Vert operatorname Bias hat theta theta right Vert 2 is the square vector norm Example Estimation of population variance For example suppose an estimator of the form T2 c i 1n Xi X 2 cnS2 displaystyle T 2 c sum i 1 n left X i overline X right 2 cnS 2 is sought for the population variance as above but this time to minimise the MSE MSE E T2 s2 2 E T2 s2 2 Var T2 displaystyle begin aligned operatorname MSE amp operatorname E left T 2 sigma 2 2 right amp left operatorname E left T 2 sigma 2 right right 2 operatorname Var T 2 end aligned If the variables X1 Xn follow a normal distribution then nS2 s2 has a chi squared distribution with n 1 degrees of freedom giving E nS2 n 1 s2 and Var nS2 2 n 1 s4 displaystyle operatorname E nS 2 n 1 sigma 2 text and operatorname Var nS 2 2 n 1 sigma 4 and so MSE c n 1 1 2s4 2c2 n 1 s4 displaystyle operatorname MSE c n 1 1 2 sigma 4 2c 2 n 1 sigma 4 With a little algebra it can be confirmed that it is c 1 n 1 which minimises this combined loss function rather than c 1 n 1 which minimises just the square of the bias More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values However it is very common that there may be perceived to be a bias variance tradeoff such that a small increase in bias can be traded for a larger decrease in variance resulting in a more desirable estimator overall Bayesian viewMost bayesians are rather unconcerned about unbiasedness at least in the formal sampling theory sense above of their estimates For example Gelman and coauthors 1995 write From a Bayesian perspective the principle of unbiasedness is reasonable in the limit of large samples but otherwise it is potentially misleading Fundamentally the difference between the Bayesian approach and the sampling theory approach above is that in the sampling theory approach the parameter is taken as fixed and then probability distributions of a statistic are considered based on the predicted sampling distribution of the data For a Bayesian however it is the data which are known and fixed and it is the unknown parameter for which an attempt is made to construct a probability distribution using Bayes theorem p 8 D I p 8 I p D 8 I displaystyle p theta mid D I propto p theta mid I p D mid theta I Here the second term the likelihood of the data given the unknown parameter value 8 depends just on the data obtained and the modelling of the data generation process However a Bayesian calculation also includes the first term the prior probability for 8 which takes account of everything the analyst may know or suspect about 8 before the data comes in This information plays no part in the sampling theory approach indeed any attempt to include it would be considered bias away from what was pointed to purely by the data To the extent that Bayesian calculations include prior information it is therefore essentially inevitable that their results will not be unbiased in sampling theory terms But the results of a Bayesian approach can differ from the sampling theory approach even if the Bayesian tries to adopt an uninformative prior For example consider again the estimation of an unknown population variance s2 of a Normal distribution with unknown mean where it is desired to optimise c in the expected loss function ExpectedLoss E cnS2 s2 2 E s4 cnS2s2 1 2 displaystyle operatorname ExpectedLoss operatorname E left left cnS 2 sigma 2 right 2 right operatorname E left sigma 4 left cn tfrac S 2 sigma 2 1 right 2 right A standard choice of uninformative prior for this problem is the Jeffreys prior p s2 1 s2 displaystyle scriptstyle p sigma 2 propto 1 sigma 2 which is equivalent to adopting a rescaling invariant flat prior for ln s2 One consequence of adopting this prior is that S2 s2 remains a pivotal quantity i e the probability distribution of S2 s2 depends only on S2 s2 independent of the value of S2 or s2 p S2s2 S2 p S2s2 s2 g S2s2 displaystyle p left tfrac S 2 sigma 2 mid S 2 right p left tfrac S 2 sigma 2 mid sigma 2 right g left tfrac S 2 sigma 2 right However while Ep S2 s2 s4 cnS2s2 1 2 s4Ep S2 s2 cnS2s2 1 2 displaystyle operatorname E p S 2 mid sigma 2 left sigma 4 left cn tfrac S 2 sigma 2 1 right 2 right sigma 4 operatorname E p S 2 mid sigma 2 left left cn tfrac S 2 sigma 2 1 right 2 right in contrast Ep s2 S2 s4 cnS2s2 1 2 s4Ep s2 S2 cnS2s2 1 2 displaystyle operatorname E p sigma 2 mid S 2 left sigma 4 left cn tfrac S 2 sigma 2 1 right 2 right neq sigma 4 operatorname E p sigma 2 mid S 2 left left cn tfrac S 2 sigma 2 1 right 2 right when the expectation is taken over the probability distribution of s2 given S2 as it is in the Bayesian case rather than S2 given s2 one can no longer take s4 as a constant and factor it out The consequence of this is that compared to the sampling theory calculation the Bayesian calculation puts more weight on larger values of s2 properly taking into account as the sampling theory calculation cannot that under this squared loss function the consequence of underestimating large values of s2 is more costly in squared loss terms than that of overestimating small values of s2 The worked out Bayesian calculation gives a scaled inverse chi squared distribution with n 1 degrees of freedom for the posterior probability distribution of s2 The expected loss is minimised when cnS2 lt s2 gt this occurs when c 1 n 3 Even with an uninformative prior therefore a Bayesian calculation may not give the same expected loss minimising result as the corresponding sampling theory calculation See alsoScience portalMathematics portalConsistent estimator Efficient estimator Estimation theory Expected loss Expected value Loss function Minimum variance unbiased estimator Omitted variable bias Optimism bias Ratio estimator Statistical decision theoryNotes For the binomial distribution why does no unbiased estimator exist for 1 p Mathematics Stack Exchange Retrieved 2023 12 27 Kozdron Michael March 2016 Evaluating the Goodness of an Estimator Bias Mean Square Error Relative Efficiency Chapter 3 PDF stat math uregina ca Retrieved 2020 09 11 Taylor Courtney January 13 2019 Unbiased and Biased Estimators ThoughtCo Retrieved 2020 09 12 Dekking Michel ed 2005 A modern introduction to probability and statistics understanding why and how Springer texts in statistics London Heidelberg Springer ISBN 978 1 85233 896 1 Richard Arnold Johnson Dean W Wichern 2007 Applied Multivariate Statistical Analysis Pearson Prentice Hall ISBN 978 0 13 187715 3 Retrieved 10 August 2012 Romano J P Siegel A F 1986 Counterexamples in Probability and Statistics Monterey California USA Wadsworth amp Brooks Cole p 168 Hardy M 1 March 2003 An Illuminating Counterexample American Mathematical Monthly 110 3 234 238 arXiv math 0206006 doi 10 2307 3647938 ISSN 0002 9890 JSTOR 3647938 Brown 1947 page 583 Lehmann 1951 Birnbaum 1961 Van der Vaart 1961 Pfanzagl 1994 Pfanzagl Johann 1979 On optimal median unbiased estimators in the presence of nuisance parameters The Annals of Statistics 7 1 187 193 doi 10 1214 aos 1176344563 Brown L D Cohen Arthur Strawderman W E 1976 A Complete Class Theorem for Strict Monotone Likelihood Ratio With Applications Ann Statist 4 4 712 722 doi 10 1214 aos 1176343543 Dodge Yadolah ed 1987 Statistical Data Analysis Based on the L1 Norm and Related Methods Papers from the First International Conference held at Neuchatel August 31 September 4 1987 Amsterdam North Holland ISBN 0 444 70273 3 Jaynes E T 2007 Probability Theory The Logic of Science Cambridge Cambridge Univ Press p 172 ISBN 978 0 521 59271 0 Klebanov Lev B Rachev Svetlozar T Fabozzi Frank J 2009 Loss Functions and the Theory of Unbiased Estimation Robust and Non Robust Models in Statistics New York Nova Scientific ISBN 978 1 60741 768 2 Taboga Marco 2010 Lectures on probability theory and mathematical statistics DeGroot Morris H 1986 Probability and Statistics 2nd ed Addison Wesley pp 414 5 ISBN 0 201 11366 X But compare it with for example the discussion in Casella Berger 2001 Statistical Inference 2nd ed Duxbury p 332 ISBN 0 534 24312 6 Gelman A et al 1995 Bayesian Data Analysis Chapman and Hall p 108 ISBN 0 412 03991 5 ReferencesBrown George W On Small Sample Estimation The Annals of Mathematical Statistics vol 18 no 4 Dec 1947 pp 582 585 JSTOR 2236236 Lehmann E L December 1951 A General Concept of Unbiasedness The Annals of Mathematical Statistics 22 4 587 592 doi 10 1214 aoms 1177729549 JSTOR 2236928 Birnbaum Allan March 1961 A Unified Theory of Estimation I The Annals of Mathematical Statistics 32 1 112 135 doi 10 1214 aoms 1177705145 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint date and year link Van der Vaart H R June 1961 Some Extensions of the Idea of Bias The Annals of Mathematical Statistics 32 2 436 447 doi 10 1214 aoms 1177705051 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint date and year link Pfanzagl Johann 1994 Parametric Statistical Theory Walter de Gruyter Stuart Alan Ord Keith Arnold Steven F 2010 Classical Inference and the Linear Model Kendall s Advanced Theory of Statistics Vol 2A Wiley ISBN 978 0 4706 8924 0 Voinov Vassily G Nikulin Mikhail S 1993 Unbiased estimators and their applications Vol 1 Univariate case Dordrect Kluwer Academic Publishers ISBN 0 7923 2382 3 Voinov Vassily G Nikulin Mikhail S 1996 Unbiased estimators and their applications Vol 2 Multivariate case Dordrect Kluwer Academic Publishers ISBN 0 7923 3939 8 Klebanov Lev B Rachev Svetlozar T Fabozzi Frank J 2009 Robust and Non Robust Models in Statistics New York Nova Scientific Publishers ISBN 978 1 60741 768 2 External links Unbiased estimator Encyclopedia of Mathematics EMS Press 2001 1994