Why Robust Statistics?
Standard deviation is a powerful measure of spread, but it has a critical weakness: extreme sensitivity to outliers. A single extreme value can dramatically inflate the SD, giving a misleading picture of typical variation.
Robust statistics provide measures of spread that resist the influence of outliers, making them essential for real-world data where measurement errors, data entry mistakes, or genuine extreme cases are common.
Example: The Outlier Effect
Breakdown Point
Median Absolute Deviation (MAD)
MAD is the most robust measure of spread. It calculates the median of absolute deviations from the median:
MAD Formula
Find Median
Calculate Deviations
Find MAD
Scaling MAD to estimate σ: For normally distributed data, MAD ≈ 0.6745 × σ. To estimate SD from MAD, multiply by 1.4826:
SD Estimate from MAD
Why 1.4826?
Interquartile Range (IQR)
IQR measures the spread of the middle 50% of data—the range between the 25th and 75th percentiles:
IQR Formula
IQR is widely used because it's simple to understand, easy to visualize in box plots, and forms the basis of the common "1.5×IQR rule" for outlier detection.
Scaling IQR to estimate σ: For normal data, IQR ≈ 1.35 × σ. To estimate SD from IQR:
SD Estimate from IQR
Comparing Robust Measures
Standard Deviation
MAD
IQR
When to Use Robust Statistics
- Exploratory analysis: When you don't know if outliers exist, start with robust measures
- Data quality issues: When data may contain errors or measurement problems
- Heavy-tailed distributions: When extreme values are expected (financial returns, insurance claims)
- Small samples: When outliers have outsized impact due to few observations
- Outlier detection: Using SD to detect outliers is circular; use IQR or MAD instead
Implementation Examples
import numpy as np
from scipy import stats
def mad(data):
"""Median Absolute Deviation"""
median = np.median(data)
return np.median(np.abs(data - median))
def scaled_mad(data):
"""MAD scaled to estimate SD (for normal data)"""
return 1.4826 * mad(data)
def iqr(data):
"""Interquartile Range"""
return np.percentile(data, 75) - np.percentile(data, 25)
# Compare on data with outlier
data = [10, 12, 11, 13, 12, 11, 100]
print(f"SD: {np.std(data, ddof=1):.2f}")
print(f"MAD: {mad(data):.2f}")
print(f"Scaled MAD: {scaled_mad(data):.2f}")
print(f"IQR: {iqr(data):.2f}")