頑健統計：MAD、IQR、外れ値に強い手法

なぜ頑健統計が必要か？

標準偏差は強力な散布度の指標ですが、致命的な弱点があります。外れ値に対する極端な敏感さです。たった1つの極端な値がSDを劇的に膨らませ、典型的なばらつきについて誤った印象を与える可能性があります。

頑健統計は外れ値の影響に耐える散布度の指標を提供します。測定エラー、データ入力ミス、あるいは本物の極端なケースが存在する実世界のデータにおいて不可欠です。

例：外れ値の影響

データ：10, 12, 11, 13, 12, 11, 100（外れ値が1つ） 標準偏差： 32.4（外れ値に支配される） MAD： 1.0（外れ値を無視する） IQR： 1.5（外れ値を無視する）

ブレークダウンポイント

統計量の「ブレークダウンポイント」とは、その統計量が無意味になるまでに極端にできるデータの割合です。SDのブレークダウンポイントは0%（1つの外れ値で破壊される）。MADとIQRのブレークダウンポイントは50%—データの半分が外れ値であっても正しく機能します。

中央絶対偏差（MAD）

MADは最も頑健な散布度の指標です。中央値からの絶対偏差の中央値を計算します。

MADの公式

MAD = median(|xᵢ - median(x)|)

中央値を求める

データセットの中央値を計算します。

偏差を計算する

各値から中央値を引き、絶対値を取ります。

MADを求める

これらの絶対偏差の中央値を計算します。

MADをσの推定に換算： 正規分布データでは、MAD ≈ 0.6745 × σ。MADからSDを推定するには1.4826を掛けます。

MADからのSD推定

σ̂ = 1.4826 × MAD

なぜ1.4826なのか？

このスケーリング係数は、正規分布におけるMADとSDの関係から導かれます。データが正規分布に従う場合、スケーリングされたMADが真の標準偏差の不偏推定量となることを保証します。

四分位範囲（IQR）

IQRはデータの中央50%の散らばり—第25パーセンタイルと第75パーセンタイルの間の範囲—を測定します。

IQRの公式

IQR = Q3 - Q1 = 75th percentile - 25th percentile

IQRは理解しやすく、箱ひげ図で視覚化しやすく、外れ値検出の一般的な「1.5×IQRルール」の基礎となるため、広く使われています。

IQRをσの推定に換算： 正規データでは、IQR ≈ 1.35 × σ。IQRからSDを推定するには：

IQRからのSD推定

σ̂ = IQR / 1.35 ≈ 0.7413 × IQR

頑健指標の比較

標準偏差

すべてのデータ点を使用 · 正規データに最も効率的 · 外れ値に非常に敏感 · ブレークダウンポイント：0%

MAD

最も頑健な指標 · 中央値を使用（平均ではない） · あらゆる外れ値に免疫 · ブレークダウンポイント：50%

IQR

理解しやすい · 箱ひげ図で使用 · 極端な50%を無視 · ブレークダウンポイント：25%

頑健統計を使うべき場面

探索的分析： 外れ値が存在するかわからない場合、頑健な指標から始める
データ品質に問題がある場合： エラーや測定上の問題を含む可能性があるデータ
裾の重い分布： 極端な値が予想される場合（金融リターン、保険金請求額）
小さな標本： 観測数が少ないため外れ値の影響が大きくなる場合
外れ値検出： SDを使って外れ値を検出するのは循環論法。代わりにIQRやMADを使用

実装例

Python

import numpy as np
from scipy import stats

def mad(data):
    """Median Absolute Deviation"""
    median = np.median(data)
    return np.median(np.abs(data - median))

def scaled_mad(data):
    """MAD scaled to estimate SD (for normal data)"""
    return 1.4826 * mad(data)

def iqr(data):
    """Interquartile Range"""
    return np.percentile(data, 75) - np.percentile(data, 25)

# Compare on data with outlier
data = [10, 12, 11, 13, 12, 11, 100]
print(f"SD: {np.std(data, ddof=1):.2f}")
print(f"MAD: {mad(data):.2f}")
print(f"Scaled MAD: {scaled_mad(data):.2f}")
print(f"IQR: {iqr(data):.2f}")

A statistics tutorial is a practical interpretation guide, not just a formula dump. It refers to the assumptions, notation, and reporting language that analysts need when they explain a result to a teacher, manager, client, or reviewer. The article body covers the specific topic, while the sections below create a common interpretation frame that readers can reuse across related metrics.

Reading goal	What to focus on	Common mistake
Definition	What the metric is and what quantity it summarizes	Treating the formula as self-explanatory
Formula choice	Sample versus population assumptions and notation	Using n when n-1 is required or vice versa
Interpretation	Whether the result indicates concentration, spread, or risk	Calling a large value good or bad without context

Frequently Asked Questions

How should I interpret a high standard deviation?

A high standard deviation means the observations are spread farther from the mean on average. Whether that spread is acceptable depends on the context: wide dispersion might signal risk in finance, instability in manufacturing, or genuine natural variation in scientific data.

Why do some articles mention n while others mention n-1?

The denominator reflects the difference between population and sample formulas. Population variance and population standard deviation use N because the full dataset is known. Sample variance and sample standard deviation often use n-1 because Bessel’s correction reduces bias when estimating population spread from a sample.

What is a statistical interpretation guide?

A statistical interpretation guide is a page that moves beyond arithmetic and explains meaning. It tells you what a metric is, when the formula applies, and how to describe the result in plain English without overstating certainty.

Can I cite this article in a report?

You should cite the underlying authoritative reference for formal work whenever possible. This page is best used as an explanatory bridge that helps you understand the concept before quoting the original standard or handbook.

Why include direct citations on every article page?

Direct citations give readers a route to verify the definition, notation, and assumptions. That improves trust and reduces the chance that a simplified explanation is mistaken for the entire technical standard.

Authoritative References

These sources define the concepts referenced most often across our articles. Bessel's correction is a sample adjustment, variance is a squared measure of spread, and standard deviation is the square root of variance expressed in the same units as the data.