Statistics Formulas

Notation

Capitalization
In general, capital letters refer to population attributes (i.e., parameters); and lower-case letters refer to sample attributes (i.e., statistics). For example,
- P refers to a population proportion; and p, to a sample proportion.
- X refers to a set of population elements; and x, to a set of sample elements.
- N refers to population size; and n, to sample size.
Greek vs. Roman Letters
Like capital letters, Greek letters refer to population attributes. Their sample counterparts, however, are usually Roman letters. For example,
- μ refers to a population mean; and x, to a sample mean.
- σ refers to the standard deviation of a population; and s, to the standard deviation of a sample.
Population Parameters
- μ refers to a population mean.
- σ refers to the standard deviation of a population.
- σ2 refers to the variance of a population.
- P refers to the proportion of population elements that have a particular attribute.
- Q refers to the proportion of population elements that do not have a particular attribute, so Q = 1 - P.
- ρ is the population correlation coefficient, based on all of the elements from a population.
- N is the number of elements in a population.
Sample Statistics
- x refers to a sample mean.
- s refers to the standard deviation of a sample.
- s2 refers to the variance of a sample.
- p refers to the proportion of sample elements that have a particular attribute.
- q refers to the proportion of sample elements that do not have a particular attribute, so q = 1 - p.
- r is the sample correlation coefficient, based on all of the elements from a sample.
- n is the number of elements in a sample.
Simple Linear Regression
- Β0 is the intercept constant in a population regression line.
- Β1 is the regression coefficient (i.e., slope) in a population regression line.
- R2 refers to the coefficient of determination.
- b0 is the intercept constant in a sample regression line.
- b1 refers to the regression coefficient in a sample regression line (i.e., the slope).
- sb1 refers to the refers to the standard error of the slope of a regression line.
Probability
- P(A) refers to the probability that event A will occur.
- P(A|B) refers to the conditional probability that event A occurs, given that event B has occurred.
- P(A') refers to the probability of the complement of event A.
- P(A ∩ B) refers to the probability of the intersection of events A and B.
- P(A ∪ B) refers to the probability of the union of events A and B.
- E(X) refers to the expected value of random variable X.
- b(x; n, P) refers to binomial probability.
- b*(x; n, P) refers to negative binomial probability.
- g(x; P) refers to geometric probability.
- h(x; N, n, k) refers to hypergeometric probability.
Counting
- n! refers to the factorial value of n.
- nPr refers to the number of permutations of n things taken r at a time.
- nCr refers to the number of combinations of n things taken r at a time.
Set Theory
- A ∩ B refers to the intersection of events A and B.
- A ∪ B refers to the union of events A and B.
- {A, B, C} refers to the set of elements consisting of A, B, and C.
- {∅} refers to the null set.
Hypothesis Testing
- H0 refers to a null hypothesis.
- H1 or Ha refers to an alternative hypothesis.
- α refers to the significance level.
- Β refers to the probability of committing a Type II error.
Random Variables
- Z or z refers to a standardized score, also known as a z score.
- zα refers to the standardized score that has a cumulative probability equal to 1 - α.
- tα refers to the t score that has a cumulative probability equal to 1 - α.
- fα refers to a f statistic that has a cumulative probability equal to 1 - α.
- fα(v1, v2) is a f statistic with a cumulative probability of 1 - α, and v1 and v2 degrees of freedom.
- Χ2 refers to a chi-square statistic.
Special Symbols
- Σ is the summation symbol, used to compute sums over a range of values.
- Σx or Σxi refers to the sum of a set of n observations. Thus, Σxi = Σx = x1 + x2 + . . . + xn.
- sqrt refers to the square root function. Thus, sqrt(4) = 2 and sqrt(25) = 5.
- Var(X) refers to the variance of the random variable X.
- SD(X) refers to the standard deviation of the random variable X.
- SE refers to the standard error of a statistic.
- ME refers to the margin of error.
- DF refers to the degrees of freedom.

Formulas
Parameters
- Population mean = μ = ( Σ Xi ) / N
- Population standard deviation = σ = sqrt [ Σ ( Xi - μ )2 / N ]
- Population variance = σ2 = Σ ( Xi - μ )2 / N
- Variance of population proportion = σP2 = PQ / n
- Standardized score = Z = (X - μ) / N
- Population correlation coefficient = ρ = [ 1 / N ] * Σ { [ (Xi - μX) / σx ] * [ (Yi - μY) / σy ] }

Statistics
Unless otherwise noted, these formulas assume simple random sampling.
- Sample mean = x = ( Σ xi ) / n
- Sample standard deviation = s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]
- Sample variance = s2 = Σ ( xi - x )2 / ( n - 1 )
- Variance of sample proportion = sp2 = pq / (n - 1)
- Pooled sample proportion = p = (p1 * n1 + p2 * n2) / (n1 + n2)
- Pooled sample standard deviation = sp = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
- Sample correlation coefficient = r = [ 1 / (n - 1) ] * Σ { [ (xi - x) / sx ] * [ (yi - y) / sy ] }

Simple Linear Regression
- Simple linear regression line: ŷ = b0 + b1x
- Regression coefficient = b1 = Σ [ (xi - x) (yi - y) ] / Σ [ (xi - x)2]
- Regression slope intercept = b0 = y - b1 * x
- Regression coefficient = b1 = r * (sy / sx)
- Standard error of regression slope = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]

Counting
- n factorial: n! = n * (n-1) * (n - 2) * . . . * 3 * 2 * 1. By convention, 0! = 1.
- Permutations of n things, taken r at a time: nCr = n! / (n - r)!
- Combinations of n things, taken r at a time: nCr = n! / r!(n - r)! = nPr / r!

Probability
- Rule of addition: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
- Rule of multiplication: P(A ∩ B) = P(A) P(B|A)
- Rule of subtraction: P(A') = 1 - P(A)

Random Variables
In the following formulas, X and Y are random variables, and a and b are constants.
- Expected value of X = E(X) = μx = Σ [ xi * P(xi) ]
- Variance of X = Var(X) = σ2 = Σ [ xi - E(x) ]2 * P(xi) = Σ [ xi - μx ]2 * P(xi)
- Normal random variable = z-score = z = (X - μ)/σ
- Chi-square statistic = Χ2 = [ ( n - 1 ) * s2 ] / σ2
- f statistic = f = [ s12/σ12 ] / [ s22/σ22 ]
- Expected value of sum of random variables = E(X + Y) = E(X) + E(Y)
- Expected value of difference between random variables = E(X - Y) = E(X) - E(Y)
- Variance of the sum of independent random variables = Var(X + Y) = Var(X) + Var(Y)
- Variance of the difference between independent random variables = Var(X - Y) = E(X) + E(Y)

Sampling Distributions
- Mean of sampling distribution of the mean = μx = μ
- Mean of sampling distribution of the proportion = μp = P
- Standard deviation of proportion = σp = sqrt[ P * (1 - P)/n ] = sqrt( PQ / n )
- Standard deviation of the mean = σx = σ/sqrt(n)
- Standard deviation of difference of sample means = σd = sqrt[ (σ12 / n1) + (σ22 / n2) ]
- Standard deviation of difference of sample proportions = σd = sqrt{ [P1(1 - P1) / n1] + [P2(1 - P2) / n2] }

Standard Error
- Standard error of proportion = SEp = sp = sqrt[ p * (1 - p)/n ] = sqrt( pq / n )
- Standard error of difference for proportions = SEp = sp = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
- Standard error of the mean = SEx = sx = s/sqrt(n)
- Standard error of difference of sample means = SEd = sd = sqrt[ (s12 / n1) + (s22 / n2) ]
- Standard error of difference of paired sample means = SEd = sd = { sqrt [ (Σ(di - d)2 / (n - 1) ] } / sqrt(n)
- Pooled sample standard error = spooled = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
- Standard error of difference of sample proportions = sd = sqrt{ [p1(1 - p1) / n1] + [p2(1 - p2) / n2] }

Discrete Probability Distributions
- Binomial formula: P(X = x) = b(x; n, P) = nCx * Px * (1 - P)n - x = nCx * Px * Qn - x
- Mean of binomial distribution = μx = n * P
- Variance of binomial distribution = σx2 = n * P * ( 1 - P )
- Negative Binomial formula: P(X = x) = b*(x; r, P) = x-1Cr-1 * Pr * (1 - P)x - r
- Mean of negative binomial distribution = μx = rQ / P
- Variance of negative binomial distribution = σx2 = r * Q / P2
- Geometric formula: P(X = x) = g(x; P) = P * Qx - 1
- Mean of geometric distribution = μx = Q / P
- Variance of geometric distribution = σx2 = Q / P2
- Hypergeometric formula: P(X = x) = h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]
- Mean of hypergeometric distribution = μx = n * k / N
- Variance of hypergeometric distribution = σx2 = n * k * ( N - k ) * ( N - n ) / [ N2 * ( N - 1 ) ]
- Poisson formula: P(x; μ) = (e-μ) (μx) / x!
- Mean of Poisson distribution = μx = μ
- Variance of Poisson distribution = σx2 = μ
- Multinomial formula: P = [ n! / ( n1! * n2! * ... nk! ) ] * ( p1n1 * p2n2 * . . . * pknk )

Linear Transformations
For the following formulas, assume that Y is a linear transformation of the random variable X, defined by the equation: Y = aX + b.
- Mean of a linear transformation = E(Y) = Y = aX + b.
- Variance of a linear transformation = Var(Y) = a2 * Var(X).
- Standardized score = z = (x - μx) / σx.
- t-score = t = (x - μx) / [ s/sqrt(n) ].

Estimation
- Confidence interval: Sample statistic + Critical value * Standard error of statistic
- Margin of error = (Critical value) * (Standard deviation of statistic)
- Margin of error = (Critical value) * (Standard error of statistic)

Hypothesis Testing
- Standardized test statistic = (Statistic - Parameter) / (Standard deviation of statistic)
- One-sample z-test for proportions: z-score = z = (p - P0) / sqrt( p * q / n )
- Two-sample z-test for proportions: z-score = z = z = [ (p1 - p2) - d ] / SE
- One-sample t-test for means: t-score = t = (x - μ) / SE
- Two-sample t-test for means: t-score = t = [ (x1 - x2) - d ] / SE
- Matched-sample t-test for means: t-score = t = [ (x1 - x2) - D ] / SE = (d - D) / SE
- Chi-square test statistic = Χ2 = Σ[ (Observed - Expected)2 / Expected ]

Degrees of Freedom

The correct formula for degrees of freedom (DF) depends on the situation (the nature of the test statistic, the number of samples, underlying assumptions, etc.).

- One-sample t-test: DF = n - 1
- Two-sample t-test: DF = (s12/n1 + s22/n2)2 / { [ (s12 / n1)2 / (n1 - 1) ] + [ (s22 / n2)2 / (n2 - 1) ] }
- Two-sample t-test, pooled standard error: DF = n1 + n2 - 2
- Simple linear regression, test slope: DF = n - 2
- Chi-square goodness of fit test: DF = k - 1
- Chi-square test for homogeneity: DF = (r - 1) * (c - 1)
- Chi-square test for independence: DF = (r - 1) * (c - 1)

Sample Size

Below, the first two formulas find the smallest sample sizes required to achieve a fixed margin of error, using simple random sampling. The third formula assigns sample to strata, based on a proportionate design. The fourth formula, Neyman allocation, uses stratified sampling to minimize variance, given a fixed sample size. And the last formula, optimum allocation, uses stratified sampling to minimize variance, given a fixed budget.

- Mean (simple random sampling): n = { z2 * σ2 * [ N / (N - 1) ] } / { ME2 + [ z2 * σ2 / (N - 1) ] }
- Proportion (simple random sampling): n = [ ( z2 * p * q ) + ME2 ] / [ ME2 + z2 * p * q / N ]
- Proportionate stratified sampling: nh = ( Nh / N ) * n
- Neyman allocation (stratified sampling): nh = n * ( Nh * σh ) / [ Σ ( Ni * σi ) ]
- Optimum allocation (stratified sampling): nh = n * [ ( Nh * σh ) / sqrt( ch ) ] / [ Σ ( Ni * σi ) / sqrt( ci ) ]