Σ
SDCalc
中級核心概念·12 min

穩健統計:MAD、IQR 與抗離群值方法

穩健統計完整指南,涵蓋中位數絕對偏差 (MAD) 和四分位距 (IQR)。了解何時使用抗離群值的離散程度量度,附範例與 Python 程式碼。

為什麼需要穩健統計?

標準差是一個強大的離散程度量度,但它有一個致命弱點:對離群值極度敏感。一個極端值就能大幅膨脹標準差,導致對典型變異程度的描述失真。

穩健統計提供了能抵抗離群值影響的離散程度量度,對於現實世界中常見的測量誤差、資料輸入錯誤或真實極端案例的資料來說,這些方法不可或缺。

範例:離群值的影響

資料:10, 12, 11, 13, 12, 11, 100(一個離群值) 標準差: 32.4(被離群值主導) MAD: 1.0(忽略離群值) IQR: 1.5(忽略離群值)

崩潰點

統計量的“崩潰點”是在該統計量失去意義之前,資料中可以有多少比例的極端值。標準差的崩潰點是 0%(一個離群值就能毀掉它)。MAD 和 IQR 的崩潰點是 50%——即使一半的資料是離群值,它們仍然有效。

中位數絕對偏差 (MAD)

MAD 是最穩健的離散程度量度。它計算各值與中位數之差的絕對值的中位數:

MAD 公式

MAD = median(|xᵢ - median(x)|)
1

計算中位數

計算資料集的中位數。
2

計算偏差

用每個值減去中位數,取絕對值。
3

計算 MAD

計算這些絕對偏差的中位數。

用 MAD 估計 σ: 對於常態分配的資料,MAD ≈ 0.6745 × σ。要從 MAD 估計標準差,乘以 1.4826:

由 MAD 估計標準差

σ̂ = 1.4826 × MAD

為什麼是 1.4826?

這個縮放係數來自常態分配中 MAD 與標準差的關係。它確保經縮放的 MAD 在資料為常態時是真實標準差的不偏估計量。

四分位距 (IQR)

IQR 衡量中間 50% 資料的離散程度——即第 25 百分位數和第 75 百分位數之間的範圍:

IQR 公式

IQR = Q3 - Q1 = 第 75 百分位數 - 第 25 百分位數

IQR 被廣泛使用,因為它簡單易懂、在盒鬚圖中容易視覺化,而且是常見的“1.5×IQR 規則”偵測離群值的基礎。

用 IQR 估計 σ: 對於常態資料,IQR ≈ 1.35 × σ。要從 IQR 估計標準差:

由 IQR 估計標準差

σ̂ = IQR / 1.35 ≈ 0.7413 × IQR

穩健量度比較

標準差

使用所有資料點 · 常態資料效率最高 · 對離群值非常敏感 · 崩潰點:0%

MAD

最穩健的量度 · 使用中位數(非平均數) · 不受任何離群值影響 · 崩潰點:50%

IQR

容易理解 · 用於盒鬚圖 · 忽略最極端的 50% · 崩潰點:25%

何時使用穩健統計

  • 探索性分析: 當你不知道是否存在離群值時,先從穩健量度開始
  • 資料品質問題: 當資料可能包含錯誤或測量問題時
  • 重尾分配: 當極端值是預期中的(金融報酬、保險理賠)
  • 小樣本: 當離群值因觀測值少而有過大影響時
  • 離群值偵測: 用標準差來偵測離群值是循環論證;改用 IQR 或 MAD

實作範例

Python
import numpy as np
from scipy import stats

def mad(data):
    """Median Absolute Deviation"""
    median = np.median(data)
    return np.median(np.abs(data - median))

def scaled_mad(data):
    """MAD scaled to estimate SD (for normal data)"""
    return 1.4826 * mad(data)

def iqr(data):
    """Interquartile Range"""
    return np.percentile(data, 75) - np.percentile(data, 25)

# Compare on data with outlier
data = [10, 12, 11, 13, 12, 11, 100]
print(f"SD: {np.std(data, ddof=1):.2f}")
print(f"MAD: {mad(data):.2f}")
print(f"Scaled MAD: {scaled_mad(data):.2f}")
print(f"IQR: {iqr(data):.2f}")

Further Reading

How to Read This Article

A statistics tutorial is a practical interpretation guide, not just a formula dump. It refers to the assumptions, notation, and reporting language that analysts need when they explain a result to a teacher, manager, client, or reviewer. The article body covers the specific topic, while the sections below create a common interpretation frame that readers can reuse across related metrics.

Reading goalWhat to focus onCommon mistake
DefinitionWhat the metric is and what quantity it summarizesTreating the formula as self-explanatory
Formula choiceSample versus population assumptions and notationUsing n when n-1 is required or vice versa
InterpretationWhether the result indicates concentration, spread, or riskCalling a large value good or bad without context

Frequently Asked Questions

How should I interpret a high standard deviation?

A high standard deviation means the observations are spread farther from the mean on average. Whether that spread is acceptable depends on the context: wide dispersion might signal risk in finance, instability in manufacturing, or genuine natural variation in scientific data.

Why do some articles mention n while others mention n-1?

The denominator reflects the difference between population and sample formulas. Population variance and population standard deviation use N because the full dataset is known. Sample variance and sample standard deviation often use n-1 because Bessel’s correction reduces bias when estimating population spread from a sample.

What is a statistical interpretation guide?

A statistical interpretation guide is a page that moves beyond arithmetic and explains meaning. It tells you what a metric is, when the formula applies, and how to describe the result in plain English without overstating certainty.

Can I cite this article in a report?

You should cite the underlying authoritative reference for formal work whenever possible. This page is best used as an explanatory bridge that helps you understand the concept before quoting the original standard or handbook.

Why include direct citations on every article page?

Direct citations give readers a route to verify the definition, notation, and assumptions. That improves trust and reduces the chance that a simplified explanation is mistaken for the entire technical standard.

Authoritative References

These sources define the concepts referenced most often across our articles. Bessel's correction is a sample adjustment, variance is a squared measure of spread, and standard deviation is the square root of variance expressed in the same units as the data.