Mean (μ / x̄)
The arithmetic average of a set of values, calculated by summing all values and dividing by the count. It represents the central tendency of the data and is the most commonly used measure of center.
Definitions of key terms in statistics, probability, and data analysis. Browse alphabetically or jump to a category.
The arithmetic average of a set of values, calculated by summing all values and dividing by the count. It represents the central tendency of the data and is the most commonly used measure of center.
The middle value in a sorted data set. If there is an even number of values, the median is the average of the two middle values. It is resistant to outliers and is preferred over the mean for skewed distributions.
The value that appears most frequently in a data set. A data set can have one mode (unimodal), multiple modes (multimodal), or no mode at all. The mode is the only measure of center applicable to nominal categorical data.
An average in which each data point is multiplied by a weight reflecting its relative importance, then summed and divided by the total weight. Used when observations contribute unequally to the result.
The nth root of the product of n positive values. Used for averaging rates, ratios, and percentage changes; always less than or equal to the arithmetic mean for positive data.
The long-run average value of a random variable, computed as the probability-weighted sum of all possible outcomes. Symbolized E(X) and equal to the population mean μ.
A measure of the amount of variation or dispersion in a set of values. It is the square root of the variance and is expressed in the same units as the original data, making it the most interpretable measure of spread.
The average of the squared differences from the mean. Variance quantifies the degree of spread in a data set and is the square of the standard deviation. Useful in mathematical operations but harder to interpret than σ.
The difference between the largest and smallest values in a data set. While simple to calculate, it only considers the two extreme values and is highly sensitive to outliers.
The difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR measures the spread of the middle 50% of data and is resistant to outliers, making it ideal for skewed distributions.
The average of the absolute differences between each data point and the mean. Less affected by extreme values than variance because deviations are not squared.
The ratio of the standard deviation to the mean, expressed as a percentage (CV = σ/μ × 100%). It allows comparison of variability between data sets with different units or scales.
Another name for the coefficient of variation expressed as a percentage. Common in analytical chemistry and quality control to express precision.
The standard deviation of the sampling distribution of a statistic, most commonly the mean. SE = σ/√n, decreasing as sample size increases. Used to construct confidence intervals.
A weighted average of the standard deviations from two or more groups, assuming equal population variances. Used in two-sample t-tests and effect size calculations.
A measure of spread for log-normally distributed data, defined as the exponential of the standard deviation of the logarithms. Always greater than or equal to 1.
A symmetric, bell-shaped probability distribution where the mean, median, and mode are all equal. Defined by two parameters: the mean (μ) and standard deviation (σ). Many natural phenomena follow an approximately normal distribution.
A normal distribution with mean 0 and standard deviation 1, denoted N(0, 1). Any normal variable can be standardized to this distribution using a z-score transformation.
For normally distributed data, approximately 68% of values fall within ±1σ, 95% within ±2σ, and 99.7% within ±3σ of the mean. A quick way to assess data spread.
A continuous probability distribution of a variable whose logarithm is normally distributed. Common for variables that are products of many positive factors, such as income or stock prices.
A symmetric, bell-shaped distribution similar to the normal but with heavier tails. Used for inference about the mean when the population standard deviation is unknown and the sample size is small.
A right-skewed distribution that arises in tests of independence, goodness of fit, and variance. Defined by its degrees of freedom; equal to the sum of squared standard normal variables.
The ratio of two independent chi-square distributions divided by their degrees of freedom. Used in ANOVA and tests comparing variances.
A distribution in which every outcome in a given range is equally likely. May be discrete (e.g., a fair die) or continuous (e.g., random number between 0 and 1).
The discrete probability distribution of the number of successes in n independent yes/no trials, each with the same success probability p. Parameters: n and p.
A discrete distribution expressing the probability of a given number of events occurring in a fixed interval of time or space, given a constant mean rate λ.
A continuous probability distribution describing the time between events in a Poisson process. Memoryless and parameterized by rate λ.
A mathematical function that gives the probabilities of occurrence of different possible outcomes. Can be discrete (PMF) or continuous (PDF).
States that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population's distribution. Foundational to statistical inference.
A single experiment with exactly two possible outcomes, success or failure, with a fixed probability p of success. The building block of the binomial distribution.
A statistical method for making decisions about a population using sample data. Involves comparing a test statistic to a critical value or p-value to decide whether to reject the null hypothesis.
A default statement that there is no effect or no difference. Researchers seek evidence to reject the null in favor of the alternative hypothesis.
The statement that contradicts the null hypothesis, typically representing the effect or difference the researcher hopes to demonstrate. Can be one-sided or two-sided.
The probability of observing a result as extreme as the test statistic, assuming the null hypothesis is true. Smaller p-values provide stronger evidence against the null hypothesis.
The threshold probability at which a researcher rejects the null hypothesis, commonly set at 0.05. Equal to the probability of a Type I error.
Rejecting the null hypothesis when it is actually true (a false positive). Its probability equals the significance level α.
Failing to reject the null hypothesis when it is actually false (a false negative). Its probability is denoted β; statistical power equals 1 − β.
The probability that a hypothesis test correctly rejects a false null hypothesis (1 − β). Higher with larger samples, larger effect sizes, and higher α.
A range of values, computed from sample data, likely to contain the true population parameter with a specified level of confidence (e.g., 95%). Wider intervals indicate less precision.
The long-run proportion of confidence intervals (built from repeated samples) that contain the true parameter. Common values: 90%, 95%, 99%.
The half-width of a confidence interval, equal to the critical value times the standard error. Indicates the maximum expected sampling error.
A hypothesis test using the t-distribution to compare means. Variants include one-sample, independent two-sample, and paired t-tests.
Analysis of Variance. A family of methods that compare means across three or more groups by partitioning total variability into between-group and within-group components.
A quantitative measure of the magnitude of a phenomenon, independent of sample size. Common measures include Cohen's d, Pearson's r, and odds ratios.
A value between −1 and 1 that measures the strength and direction of the linear relationship between two variables. Values near ±1 indicate a strong linear relationship; 0 indicates no linear association.
The most common correlation coefficient, measuring the linear relationship between two continuous variables. Sensitive to outliers and assumes interval-level measurement.
A non-parametric correlation based on the ranks of the data rather than raw values. Measures the strength of any monotonic relationship and is robust to outliers.
A measure of the joint variability of two random variables. Positive when variables tend to move together, negative when they move in opposite directions; standardized form is the Pearson correlation.
A statistical method for modeling the relationship between a dependent variable and one or more independent variables. Linear regression fits the best straight line through the data.
The coefficient of determination; the proportion of variance in the dependent variable explained by the independent variables in a regression model. Ranges from 0 to 1.
The complete set of all individuals or observations of interest in a study. Population parameters are typically denoted with Greek letters (μ, σ).
A subset of a population selected for analysis. Sample statistics are typically denoted with Latin letters (x̄, s) and are used to estimate population parameters.
The process of selecting a subset of individuals from a population for study. Methods include simple random, stratified, cluster, and systematic sampling.
The probability distribution of a statistic obtained by drawing all possible samples of the same size from a population. The sampling distribution of the mean is approximately normal by the CLT.
A systematic error that causes the expected value of an estimator to differ from the true parameter. Examples include selection bias, response bias, and measurement bias.
The use of n−1 instead of n in the denominator when calculating sample variance. This correction provides an unbiased estimate of the population variance from a sample.
The number of observations in a sample. Larger samples generally yield more precise estimates and greater statistical power but cost more to collect.
A numerical value that summarizes a characteristic of an entire population, such as μ or σ. Usually unknown and estimated from sample statistics.
A numerical value computed from a sample, such as x̄ or s. Used to estimate population parameters.
A rule or formula for calculating an estimate of a parameter from sample data. Good estimators are unbiased, consistent, and efficient.
A data point that is significantly different from other observations. Common detection methods include values beyond ±2 or ±3 standard deviations from the mean, or outside Q1 − 1.5×IQR and Q3 + 1.5×IQR.
A measure of the asymmetry of a probability distribution. Positive skew means the tail extends to the right; negative skew means it extends to the left; zero skew indicates symmetry.
A measure of the tailedness of a probability distribution. High kurtosis (leptokurtic) indicates heavy tails and a sharp peak; low kurtosis (platykurtic) indicates light tails and a flat peak.
The number of standard deviations a data point is from the mean, calculated as Z = (X − μ) / σ. Z-scores allow comparison of values from different distributions and identify outliers.
A value below which a given percentage of observations fall. For example, the 90th percentile is the value below which 90% of the data points are found.
Values that divide a sorted data set into four equal parts. Q1 (25th percentile), Q2 (median, 50th), and Q3 (75th percentile). Used to compute the IQR and construct box plots.
The number of independent values that can vary in a statistical calculation. For sample standard deviation, df = n − 1, reflecting Bessel's correction.
A summary showing how often each value (or range of values) occurs in a data set. Often visualized with histograms or bar charts.
Statistical methods that perform well even when assumptions are violated or when outliers are present. Examples include the median, MAD, and trimmed mean.