![Statistical model](https://www.english.nina.az/wikipedia/image/aHR0cHM6Ly91cGxvYWQud2lraW1lZGlhLm9yZy93aWtpcGVkaWEvY29tbW9ucy90aHVtYi8zLzNlL051dm9sYV9hcHBzX2VkdV9tYXRoZW1hdGljc19ibHVlLXAuc3ZnLzE2MDBweC1OdXZvbGFfYXBwc19lZHVfbWF0aGVtYXRpY3NfYmx1ZS1wLnN2Zy5wbmc=.png )
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).
Introduction
Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.
The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6. From that assumption, we can calculate the probability of both dice coming up 5: 1/6 × 1/6 = 1/36. More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5: 1/8 × 1/8 = 1/64. We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.
The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.
Formal definition
In mathematical terms, a statistical model is a pair (), where
is the set of possible observations, i.e. the sample space, and
is a set of probability distributions on
. The set
represents all of the models that are considered possible. This set is typically parameterized:
. The set
defines the parameters of the model. If a parameterization is such that distinct parameter values give rise to distinct distributions, i.e.
(in other words, the mapping is injective), it is said to be identifiable.
In some cases, the model can be more complex.
- In Bayesian statistics, the model is extended by adding a probability distribution over the parameter space
.
- A statistical model can sometimes distinguish two sets of probability distributions. The first set
is the set of models considered for inference. The second set
is the set of models that could have generated the data which is much larger than
. Such statistical models are key in checking that a given procedure is robust, i.e. that it does not produce catastrophic errors when its assumptions about the data are incorrect.
An example
Suppose that we have a population of children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child. This implies that height is predicted by age, with some error.
An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0 + b1agei) cannot be admissible for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is consistent with all the data points. To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution. We can formally specify the model in the form () as follows. The sample space,
, of our model comprises the set of all possible pairs (age, height). Each possible value of
= (b0, b1, σ2) determines a distribution on
; denote that distribution by
. If
is the set of all possible values of
, then
. (The parameterization is identifiable, and this is easy to check.)
In this example, the model is determined by (1) specifying and (2) making some assumptions relevant to
. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify
—as they are required to do.
General remarks
A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".
There are three purposes for a statistical model, according to Konishi & Kitagawa:
- Predictions
- Extraction of information
- Description of stochastic structures
Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description.
Dimension of a model
Suppose that we have a statistical model () with
. In notation, we write that
where k is a positive integer (
denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model. The model is said to be parametric if
has finite dimension.[citation needed] As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that
.
In this example, the dimension, k, equals 2. As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.)
Although formally is a single parameter that has dimension k, it is sometimes regarded as comprising k separate parameters. For example, with the univariate Gaussian distribution,
is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model is nonparametric if the parameter set
is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the dimension of
and n is the number of samples, both semiparametric and nonparametric models have
as
. If
as
, then the model is semiparametric; otherwise, the model is nonparametric.
Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".
Nested models
This section needs additional citations for verification.(November 2023) |
Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model
- y = b0 + b1x + b2x2 + ε, ε ~ 𝒩(0, σ2)
has, nested within it, the linear model
- y = b0 + b1x + ε, ε ~ 𝒩(0, σ2)
—we constrain the parameter b2 to equal 0.
In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.
Comparing models
Comparing statistical models is fundamental for much of statistical inference. Konishi & Kitagawa (2008, p. 75) state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models." Common criteria for comparing models include the following: R2, Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.
Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam.
See also
- All models are wrong
- Blockmodel
- Conceptual model
- Design of experiments
- Deterministic model
- Effective theory
- Predictive model
- Response modeling methodology
- SackSEER
- Scientific model
- Statistical inference
- Statistical model specification
- Statistical model validation
- Statistical theory
- Stochastic process
Notes
- Cox 2006, p. 178
- Adèr 2008, p. 280
- McCullagh 2002
- Cox 2006, p. 197
- Konishi & Kitagawa 2008, §1.1
- Friendly & Meyer 2016, §11.6
- Cox 2006, p. 2
- Le Cam, Lucien (1964). "Sufficiency and Approximate Sufficiency". Annals of Mathematical Statistics. 35 (4). Institute of Mathematical Statistics: 1429. doi:10.1214/aoms/1177700372.
This article includes a list of general references, but it lacks sufficient corresponding inline citations.(September 2010) |
References
- Adèr, H. J. (2008), "Modelling", in Adèr, H. J.; Mellenbergh, G. J. (eds.), Advising on Research Methods: A consultant's companion, Huizen, The Netherlands: Johannes van Kessel Publishing, pp. 271–304.
- Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference (2nd ed.), Springer-Verlag.
- Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press.
- Friendly, M.; Meyer, D. (2016), Discrete Data Analysis with R, Chapman & Hall.
- Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer.
- McCullagh, P. (2002), "What is a statistical model?" (PDF), Annals of Statistics, 30 (5): 1225–1310, doi:10.1214/aos/1035844977.
Further reading
- Davison, A. C. (2008), Statistical Models, Cambridge University Press
- Drton, M.; Sullivant, S. (2007), "Algebraic statistical models" (PDF), Statistica Sinica, 17: 1273–1297
- Freedman, D. A. (2009), Statistical Models, Cambridge University Press
- Helland, I. S. (2010), Steps Towards a Unified Basis for Scientific Models and Methods, World Scientific
- Kroese, D. P.; Chan, J. C. C. (2014), Statistical Modeling and Computation, Springer
- Shmueli, G. (2010), "To explain or to predict?", Statistical Science, 25 (3): 289–310, arXiv:1101.0891, doi:10.1214/10-STS330, S2CID 15900983
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data and similar data from a larger population A statistical model represents often in considerably idealized form the data generating process When referring specifically to probabilities the corresponding term is probabilistic model All statistical hypothesis tests and all statistical estimators are derived via statistical models More generally statistical models are part of the foundation of statistical inference A statistical model is usually specified as a mathematical relationship between one or more random variables and other non random variables As such a statistical model is a formal representation of a theory Herman Ader quoting Kenneth Bollen IntroductionInformally a statistical model can be thought of as a statistical assumption or set of statistical assumptions with a certain property that the assumption allows us to calculate the probability of any event As an example consider a pair of ordinary six sided dice We will study two different statistical assumptions about the dice The first statistical assumption is this for each of the dice the probability of each face 1 2 3 4 5 and 6 coming up is 1 6 From that assumption we can calculate the probability of both dice coming up 5 1 6 1 6 1 36 More generally we can calculate the probability of any event e g 1 and 2 or 3 and 3 or 5 and 6 The alternative statistical assumption is this for each of the dice the probability of the face 5 coming up is 1 8 because the dice are weighted From that assumption we can calculate the probability of both dice coming up 5 1 8 1 8 1 64 We cannot however calculate the probability of any other nontrivial event as the probabilities of the other faces are unknown The first statistical assumption constitutes a statistical model because with the assumption alone we can calculate the probability of any event The alternative statistical assumption does not constitute a statistical model because with the assumption alone we cannot calculate the probability of every event In the example above with the first assumption calculating the probability of an event is easy With some other examples though the calculation can be difficult or even impractical e g it might require millions of years of computation For an assumption to constitute a statistical model such difficulty is acceptable doing the calculation does not need to be practicable just theoretically possible Formal definitionIn mathematical terms a statistical model is a pair S P displaystyle S mathcal P where S displaystyle S is the set of possible observations i e the sample space and P displaystyle mathcal P is a set of probability distributions on S displaystyle S The set P displaystyle mathcal P represents all of the models that are considered possible This set is typically parameterized P F8 8 8 displaystyle mathcal P F theta theta in Theta The set 8 displaystyle Theta defines the parameters of the model If a parameterization is such that distinct parameter values give rise to distinct distributions i e F81 F82 81 82 displaystyle F theta 1 F theta 2 Rightarrow theta 1 theta 2 in other words the mapping is injective it is said to be identifiable In some cases the model can be more complex In Bayesian statistics the model is extended by adding a probability distribution over the parameter space 8 displaystyle Theta A statistical model can sometimes distinguish two sets of probability distributions The first set Q F8 8 8 displaystyle mathcal Q F theta theta in Theta is the set of models considered for inference The second set P Fl l L displaystyle mathcal P F lambda lambda in Lambda is the set of models that could have generated the data which is much larger than Q displaystyle mathcal Q Such statistical models are key in checking that a given procedure is robust i e that it does not produce catastrophic errors when its assumptions about the data are incorrect An exampleSuppose that we have a population of children with the ages of the children distributed uniformly in the population The height of a child will be stochastically related to the age e g when we know that a child is of age 7 this influences the chance of the child being 1 5 meters tall We could formalize that relationship in a linear regression model like this heighti b0 b1agei ei where b0 is the intercept b1 is a parameter that age is multiplied by to obtain a prediction of height ei is the error term and i identifies the child This implies that height is predicted by age with some error An admissible model must be consistent with all the data points Thus a straight line heighti b0 b1agei cannot be admissible for a model of the data unless it exactly fits all the data points i e all the data points lie perfectly on the line The error term ei must be included in the equation so that the model is consistent with all the data points To do statistical inference we would first need to assume some probability distributions for the ei For instance we might assume that the ei distributions are i i d Gaussian with zero mean In this instance the model would have 3 parameters b0 b1 and the variance of the Gaussian distribution We can formally specify the model in the form S P displaystyle S mathcal P as follows The sample space S displaystyle S of our model comprises the set of all possible pairs age height Each possible value of 8 displaystyle theta b0 b1 s2 determines a distribution on S displaystyle S denote that distribution by F8 displaystyle F theta If 8 displaystyle Theta is the set of all possible values of 8 displaystyle theta then P F8 8 8 displaystyle mathcal P F theta theta in Theta The parameterization is identifiable and this is easy to check In this example the model is determined by 1 specifying S displaystyle S and 2 making some assumptions relevant to P displaystyle mathcal P There are two assumptions that height can be approximated by a linear function of age that errors in the approximation are distributed as i i d Gaussian The assumptions are sufficient to specify P displaystyle mathcal P as they are required to do General remarksA statistical model is a special class of mathematical model What distinguishes a statistical model from other mathematical models is that a statistical model is non deterministic Thus in a statistical model specified via mathematical equations some of the variables do not have specific values but instead have probability distributions i e some of the variables are stochastic In the above example with children s heights e is a stochastic variable without that stochastic variable the model would be deterministic Statistical models are often used even when the data generating process being modeled is deterministic For instance coin tossing is in principle a deterministic process yet it is commonly modeled as stochastic via a Bernoulli process Choosing an appropriate statistical model to represent a given data generating process is sometimes extremely difficult and may require knowledge of both the process and relevant statistical analyses Relatedly the statistician Sir David Cox has said How the translation from subject matter problem to statistical model is done is often the most critical part of an analysis There are three purposes for a statistical model according to Konishi amp Kitagawa Predictions Extraction of information Description of stochastic structures Those three purposes are essentially the same as the three purposes indicated by Friendly amp Meyer prediction estimation description Dimension of a modelSuppose that we have a statistical model S P displaystyle S mathcal P with P F8 8 8 displaystyle mathcal P F theta theta in Theta In notation we write that 8 Rk displaystyle Theta subseteq mathbb R k where k is a positive integer R displaystyle mathbb R denotes the real numbers other sets can be used in principle Here k is called the dimension of the model The model is said to be parametric if 8 displaystyle Theta has finite dimension citation needed As an example if we assume that data arise from a univariate Gaussian distribution then we are assuming that P Fm s x 12psexp x m 22s2 m R s gt 0 displaystyle mathcal P left F mu sigma x equiv frac 1 sqrt 2 pi sigma exp left frac x mu 2 2 sigma 2 right mu in mathbb R sigma gt 0 right In this example the dimension k equals 2 As another example suppose that the data consists of points x y that we assume are distributed according to a straight line with i i d Gaussian residuals with zero mean this leads to the same statistical model as was used in the example with children s heights The dimension of the statistical model is 3 the intercept of the line the slope of the line and the variance of the distribution of the residuals Note the set of all possible lines has dimension 2 even though geometrically a line has dimension 1 Although formally 8 8 displaystyle theta in Theta is a single parameter that has dimension k it is sometimes regarded as comprising k separate parameters For example with the univariate Gaussian distribution 8 displaystyle theta is formally a single parameter with dimension 2 but it is often regarded as comprising 2 separate parameters the mean and the standard deviation A statistical model is nonparametric if the parameter set 8 displaystyle Theta is infinite dimensional A statistical model is semiparametric if it has both finite dimensional and infinite dimensional parameters Formally if k is the dimension of 8 displaystyle Theta and n is the number of samples both semiparametric and nonparametric models have k displaystyle k rightarrow infty as n displaystyle n rightarrow infty If k n 0 displaystyle k n rightarrow 0 as n displaystyle n rightarrow infty then the model is semiparametric otherwise the model is nonparametric Parametric models are by far the most commonly used statistical models Regarding semiparametric and nonparametric models Sir David Cox has said These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies Nested modelsThis section needs additional citations for verification Please help improve this article by adding citations to reliable sources in this section Unsourced material may be challenged and removed Find sources Statistical model news newspapers books scholar JSTOR November 2023 Learn how and when to remove this message Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model As an example the set of all Gaussian distributions has nested within it the set of zero mean Gaussian distributions we constrain the mean in the set of all Gaussian distributions to get the zero mean distributions As a second example the quadratic model y b0 b1x b2x2 e e 𝒩 0 s2 has nested within it the linear model y b0 b1x e e 𝒩 0 s2 we constrain the parameter b2 to equal 0 In both those examples the first model has a higher dimension than the second model for the first example the zero mean model has dimension 1 Such is often but not always the case As an example where they have the same dimension the set of positive mean Gaussian distributions is nested within the set of all Gaussian distributions they both have dimension 2 Comparing modelsComparing statistical models is fundamental for much of statistical inference Konishi amp Kitagawa 2008 p 75 state The majority of the problems in statistical inference can be considered to be problems related to statistical modeling They are typically formulated as comparisons of several statistical models Common criteria for comparing models include the following R2 Bayes factor Akaike information criterion and the likelihood ratio test together with its generalization the relative likelihood Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam See alsoMathematics portalAll models are wrong Blockmodel Conceptual model Design of experiments Deterministic model Effective theory Predictive model Response modeling methodology SackSEER Scientific model Statistical inference Statistical model specification Statistical model validation Statistical theory Stochastic processNotesCox 2006 p 178 Ader 2008 p 280 McCullagh 2002 Cox 2006 p 197 Konishi amp Kitagawa 2008 1 1 Friendly amp Meyer 2016 11 6 Cox 2006 p 2 Le Cam Lucien 1964 Sufficiency and Approximate Sufficiency Annals of Mathematical Statistics 35 4 Institute of Mathematical Statistics 1429 doi 10 1214 aoms 1177700372 This article includes a list of general references but it lacks sufficient corresponding inline citations Please help to improve this article by introducing more precise citations September 2010 Learn how and when to remove this message ReferencesAder H J 2008 Modelling in Ader H J Mellenbergh G J eds Advising on Research Methods A consultant s companion Huizen The Netherlands Johannes van Kessel Publishing pp 271 304 Burnham K P Anderson D R 2002 Model Selection and Multimodel Inference 2nd ed Springer Verlag Cox D R 2006 Principles of Statistical Inference Cambridge University Press Friendly M Meyer D 2016 Discrete Data Analysis with R Chapman amp Hall Konishi S Kitagawa G 2008 Information Criteria and Statistical Modeling Springer McCullagh P 2002 What is a statistical model PDF Annals of Statistics 30 5 1225 1310 doi 10 1214 aos 1035844977 Further readingDavison A C 2008 Statistical Models Cambridge University Press Drton M Sullivant S 2007 Algebraic statistical models PDF Statistica Sinica 17 1273 1297 Freedman D A 2009 Statistical Models Cambridge University Press Helland I S 2010 Steps Towards a Unified Basis for Scientific Models and Methods World Scientific Kroese D P Chan J C C 2014 Statistical Modeling and Computation Springer Shmueli G 2010 To explain or to predict Statistical Science 25 3 289 310 arXiv 1101 0891 doi 10 1214 10 STS330 S2CID 15900983