Statistics Glossary

Definitions of key terms in statistics, probability, and data analysis. Browse alphabetically or jump to a category.

Measures of Central Tendency

Mean (μ / x̄)

The arithmetic average of a set of values, calculated by summing all values and dividing by the count. It represents the central tendency of the data and is the most commonly used measure of center.

Related terms:Median, Mode, Expected Value, Weighted Mean

Median

The middle value in a sorted data set. If there is an even number of values, the median is the average of the two middle values. It is resistant to outliers and is preferred over the mean for skewed distributions.

Related terms:Mean (μ / x̄), Mode, Quartile, Percentile

Mode

The value that appears most frequently in a data set. A data set can have one mode (unimodal), multiple modes (multimodal), or no mode at all. The mode is the only measure of center applicable to nominal categorical data.

Weighted Mean

An average in which each data point is multiplied by a weight reflecting its relative importance, then summed and divided by the total weight. Used when observations contribute unequally to the result.

Geometric Mean

The nth root of the product of n positive values. Used for averaging rates, ratios, and percentage changes; always less than or equal to the arithmetic mean for positive data.

Expected Value

The long-run average value of a random variable, computed as the probability-weighted sum of all possible outcomes. Symbolized E(X) and equal to the population mean μ.

Measures of Variability

Standard Deviation (σ / s)

A measure of the amount of variation or dispersion in a set of values. It is the square root of the variance and is expressed in the same units as the original data, making it the most interpretable measure of spread.

Variance (σ² / s²)

The average of the squared differences from the mean. Variance quantifies the degree of spread in a data set and is the square of the standard deviation. Useful in mathematical operations but harder to interpret than σ.

Range

The difference between the largest and smallest values in a data set. While simple to calculate, it only considers the two extreme values and is highly sensitive to outliers.

Interquartile Range (IQR)

The difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR measures the spread of the middle 50% of data and is resistant to outliers, making it ideal for skewed distributions.

Related terms:Quartile, Percentile, Outlier, Range

Mean Absolute Deviation (MAD)

The average of the absolute differences between each data point and the mean. Less affected by extreme values than variance because deviations are not squared.

Coefficient of Variation (CV)

The ratio of the standard deviation to the mean, expressed as a percentage (CV = σ/μ × 100%). It allows comparison of variability between data sets with different units or scales.

Relative Standard Deviation (RSD)

Another name for the coefficient of variation expressed as a percentage. Common in analytical chemistry and quality control to express precision.

Standard Error (SE)

The standard deviation of the sampling distribution of a statistic, most commonly the mean. SE = σ/√n, decreasing as sample size increases. Used to construct confidence intervals.

Pooled Standard Deviation

A weighted average of the standard deviations from two or more groups, assuming equal population variances. Used in two-sample t-tests and effect size calculations.

Geometric Standard Deviation

A measure of spread for log-normally distributed data, defined as the exponential of the standard deviation of the logarithms. Always greater than or equal to 1.

Probability Distributions

Normal Distribution

A symmetric, bell-shaped probability distribution where the mean, median, and mode are all equal. Defined by two parameters: the mean (μ) and standard deviation (σ). Many natural phenomena follow an approximately normal distribution.

Standard Normal Distribution

A normal distribution with mean 0 and standard deviation 1, denoted N(0, 1). Any normal variable can be standardized to this distribution using a z-score transformation.

Related terms:Normal Distribution, Z-Score

Empirical Rule (68-95-99.7)

For normally distributed data, approximately 68% of values fall within ±1σ, 95% within ±2σ, and 99.7% within ±3σ of the mean. A quick way to assess data spread.

Log-Normal Distribution

A continuous probability distribution of a variable whose logarithm is normally distributed. Common for variables that are products of many positive factors, such as income or stock prices.

t-Distribution

A symmetric, bell-shaped distribution similar to the normal but with heavier tails. Used for inference about the mean when the population standard deviation is unknown and the sample size is small.

Chi-Square Distribution (χ²)

A right-skewed distribution that arises in tests of independence, goodness of fit, and variance. Defined by its degrees of freedom; equal to the sum of squared standard normal variables.

F-Distribution

The ratio of two independent chi-square distributions divided by their degrees of freedom. Used in ANOVA and tests comparing variances.

Uniform Distribution

A distribution in which every outcome in a given range is equally likely. May be discrete (e.g., a fair die) or continuous (e.g., random number between 0 and 1).

Related terms:Probability Distribution

Binomial Distribution

The discrete probability distribution of the number of successes in n independent yes/no trials, each with the same success probability p. Parameters: n and p.

Poisson Distribution

A discrete distribution expressing the probability of a given number of events occurring in a fixed interval of time or space, given a constant mean rate λ.

Exponential Distribution

A continuous probability distribution describing the time between events in a Poisson process. Memoryless and parameterized by rate λ.

Probability Distribution

A mathematical function that gives the probabilities of occurrence of different possible outcomes. Can be discrete (PMF) or continuous (PDF).

Related terms:Normal Distribution, Expected Value

Central Limit Theorem

States that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population's distribution. Foundational to statistical inference.

Bernoulli Trial

A single experiment with exactly two possible outcomes, success or failure, with a fixed probability p of success. The building block of the binomial distribution.

Hypothesis Testing

A statistical method for making decisions about a population using sample data. Involves comparing a test statistic to a critical value or p-value to decide whether to reject the null hypothesis.

Null Hypothesis (H₀)

A default statement that there is no effect or no difference. Researchers seek evidence to reject the null in favor of the alternative hypothesis.

Alternative Hypothesis (H₁ / Hₐ)

The statement that contradicts the null hypothesis, typically representing the effect or difference the researcher hopes to demonstrate. Can be one-sided or two-sided.

P-Value

The probability of observing a result as extreme as the test statistic, assuming the null hypothesis is true. Smaller p-values provide stronger evidence against the null hypothesis.

Significance Level (α)

The threshold probability at which a researcher rejects the null hypothesis, commonly set at 0.05. Equal to the probability of a Type I error.

Related terms:P-Value, Type I Error, Hypothesis Testing

Type I Error

Rejecting the null hypothesis when it is actually true (a false positive). Its probability equals the significance level α.

Type II Error

Failing to reject the null hypothesis when it is actually false (a false negative). Its probability is denoted β; statistical power equals 1 − β.

Statistical Power

The probability that a hypothesis test correctly rejects a false null hypothesis (1 − β). Higher with larger samples, larger effect sizes, and higher α.

Related terms:Type II Error, Effect Size, Sample Size (n)

Confidence Interval

A range of values, computed from sample data, likely to contain the true population parameter with a specified level of confidence (e.g., 95%). Wider intervals indicate less precision.

Confidence Level

The long-run proportion of confidence intervals (built from repeated samples) that contain the true parameter. Common values: 90%, 95%, 99%.

Margin of Error

The half-width of a confidence interval, equal to the critical value times the standard error. Indicates the maximum expected sampling error.

t-Test

A hypothesis test using the t-distribution to compare means. Variants include one-sample, independent two-sample, and paired t-tests.

ANOVA

Analysis of Variance. A family of methods that compare means across three or more groups by partitioning total variability into between-group and within-group components.

Effect Size

A quantitative measure of the magnitude of a phenomenon, independent of sample size. Common measures include Cohen's d, Pearson's r, and odds ratios.

Correlation & Regression

Correlation Coefficient (r)

A value between −1 and 1 that measures the strength and direction of the linear relationship between two variables. Values near ±1 indicate a strong linear relationship; 0 indicates no linear association.

Related terms:Pearson Correlation, Covariance, Regression

Pearson Correlation

The most common correlation coefficient, measuring the linear relationship between two continuous variables. Sensitive to outliers and assumes interval-level measurement.

Spearman Correlation

A non-parametric correlation based on the ranks of the data rather than raw values. Measures the strength of any monotonic relationship and is robust to outliers.

Covariance

A measure of the joint variability of two random variables. Positive when variables tend to move together, negative when they move in opposite directions; standardized form is the Pearson correlation.

Regression

A statistical method for modeling the relationship between a dependent variable and one or more independent variables. Linear regression fits the best straight line through the data.

R-Squared (R²)

The coefficient of determination; the proportion of variance in the dependent variable explained by the independent variables in a regression model. Ranges from 0 to 1.

Sampling & Estimation

Population

The complete set of all individuals or observations of interest in a study. Population parameters are typically denoted with Greek letters (μ, σ).

Related terms:Sample, Parameter, Sampling

Sample

A subset of a population selected for analysis. Sample statistics are typically denoted with Latin letters (x̄, s) and are used to estimate population parameters.

Related terms:Population, Sampling, Statistic

Sampling

The process of selecting a subset of individuals from a population for study. Methods include simple random, stratified, cluster, and systematic sampling.

Related terms:Sample, Population, Bias

Sampling Distribution

The probability distribution of a statistic obtained by drawing all possible samples of the same size from a population. The sampling distribution of the mean is approximately normal by the CLT.

Bias

A systematic error that causes the expected value of an estimator to differ from the true parameter. Examples include selection bias, response bias, and measurement bias.

Related terms:Bessel's Correction, Sampling

Bessel's Correction

The use of n−1 instead of n in the denominator when calculating sample variance. This correction provides an unbiased estimate of the population variance from a sample.

Sample Size (n)

The number of observations in a sample. Larger samples generally yield more precise estimates and greater statistical power but cost more to collect.

Parameter

A numerical value that summarizes a characteristic of an entire population, such as μ or σ. Usually unknown and estimated from sample statistics.

Related terms:Statistic, Population, Estimator

Statistic

A numerical value computed from a sample, such as x̄ or s. Used to estimate population parameters.

Related terms:Parameter, Sample, Estimator

Estimator

A rule or formula for calculating an estimate of a parameter from sample data. Good estimators are unbiased, consistent, and efficient.

Related terms:Parameter, Statistic, Bias

Data Properties & Diagnostics

Outlier

A data point that is significantly different from other observations. Common detection methods include values beyond ±2 or ±3 standard deviations from the mean, or outside Q1 − 1.5×IQR and Q3 + 1.5×IQR.

Skewness

A measure of the asymmetry of a probability distribution. Positive skew means the tail extends to the right; negative skew means it extends to the left; zero skew indicates symmetry.

Related terms:Kurtosis, Normal Distribution, Median

Kurtosis

A measure of the tailedness of a probability distribution. High kurtosis (leptokurtic) indicates heavy tails and a sharp peak; low kurtosis (platykurtic) indicates light tails and a flat peak.

Related terms:Skewness, Normal Distribution, Outlier

Z-Score

The number of standard deviations a data point is from the mean, calculated as Z = (X − μ) / σ. Z-scores allow comparison of values from different distributions and identify outliers.

Percentile

A value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data points are found.

Related terms:Quartile, Median, Interquartile Range (IQR)

Quartile

Values that divide a sorted data set into four equal parts. Q1 (25th percentile), Q2 (median, 50th), and Q3 (75th percentile). Used to compute the IQR and construct box plots.

Related terms:Percentile, Interquartile Range (IQR), Median

Degrees of Freedom (df)

The number of independent values that can vary in a statistical calculation. For sample standard deviation, df = n − 1, reflecting Bessel's correction.

Frequency Distribution

A summary showing how often each value (or range of values) occurs in a data set. Often visualized with histograms or bar charts.

Related terms:Probability Distribution, Mode

Robust Statistics

Statistical methods that perform well even when assumptions are violated or when outliers are present. Examples include the median, MAD, and trimmed mean.

Related terms:Outlier, Median, Interquartile Range (IQR)