로버스트 통계: MAD, IQR, 이상치 내성 방법

왜 로버스트 통계인가?

표준편차는 강력한 산포 측도이지만 치명적인 약점이 있습니다: 이상치에 대한 극단적 민감성. 극단값 하나만으로도 SD가 급격히 부풀어 전형적인 변동에 대해 잘못된 그림을 제공할 수 있습니다.

로버스트 통계는 이상치의 영향에 저항하는 산포 측도를 제공하며, 측정 오류, 데이터 입력 실수, 또는 실제 극단 사례가 흔한 현실 데이터에서 필수적입니다.

예시: 이상치의 영향

데이터: 10, 12, 11, 13, 12, 11, 100 (이상치 하나) 표준편차: 32.4 (이상치에 지배됨) MAD: 1.0 (이상치를 무시함) IQR: 1.5 (이상치를 무시함)

붕괴점

통계량의 “붕괴점(breakdown point)”은 통계량이 무의미해지기 전에 극단값이 될 수 있는 데이터의 비율입니다. SD의 붕괴점은 0%(이상치 하나로도 무너짐)이고, MAD와 IQR의 붕괴점은 50%—데이터의 절반이 이상치여도 여전히 작동합니다.

중앙값 절대 편차 (MAD)

MAD는 가장 로버스트한 산포 측도입니다. 중앙값으로부터의 절대 편차의 중앙값을 계산합니다:

MAD 공식

MAD = median(|xᵢ - median(x)|)

중앙값 구하기

데이터의 중앙값을 계산합니다.

편차 계산

각 값에서 중앙값을 빼고 절대값을 취합니다.

MAD 구하기

이 절대 편차의 중앙값을 계산합니다.

MAD를 σ 추정에 활용: 정규분포 데이터에서 MAD ≈ 0.6745 × σ입니다. MAD에서 SD를 추정하려면 1.4826을 곱합니다:

MAD에서 SD 추정

σ̂ = 1.4826 × MAD

왜 1.4826인가?

이 환산 계수는 정규분포에서 MAD와 SD의 관계에서 유래합니다. 데이터가 정규분포일 때 환산된 MAD가 진짜 표준편차의 불편 추정량이 되도록 보장합니다.

사분위수 범위 (IQR)

IQR은 데이터의 중간 50%의 퍼짐—25번째와 75번째 백분위수 사이의 범위—를 측정합니다:

IQR 공식

IQR = Q3 - Q1 = 75번째 백분위수 - 25번째 백분위수

IQR은 이해하기 쉽고, 상자 그림에서 시각화하기 쉬우며, 일반적인 “1.5×IQR 규칙”으로 이상치 탐지의 기초가 되므로 널리 사용됩니다.

IQR에서 σ 추정: 정규분포 데이터에서 IQR ≈ 1.35 × σ입니다. IQR에서 SD를 추정하려면:

IQR에서 SD 추정

σ̂ = IQR / 1.35 ≈ 0.7413 × IQR

로버스트 측도 비교

표준편차

모든 데이터 사용 · 정규분포에서 가장 효율적 · 이상치에 매우 민감 · 붕괴점: 0%

MAD

가장 로버스트한 측도 · 중앙값 사용(평균 아님) · 이상치에 영향 없음 · 붕괴점: 50%

IQR

이해하기 쉬움 · 상자 그림에 사용 · 극단 50% 무시 · 붕괴점: 25%

로버스트 통계를 사용할 때

탐색적 분석: 이상치의 존재 여부를 모를 때 로버스트 측도로 시작
데이터 품질 문제: 데이터에 오류나 측정 문제가 있을 수 있을 때
두꺼운 꼬리 분포: 극단값이 예상될 때 (금융 수익률, 보험 청구)
소규모 표본: 적은 관측으로 인해 이상치의 영향이 과도할 때
이상치 탐지: SD로 이상치를 탐지하는 것은 순환 논리; IQR이나 MAD를 대신 사용

구현 예시

Python

import numpy as np
from scipy import stats

def mad(data):
    """Median Absolute Deviation"""
    median = np.median(data)
    return np.median(np.abs(data - median))

def scaled_mad(data):
    """MAD scaled to estimate SD (for normal data)"""
    return 1.4826 * mad(data)

def iqr(data):
    """Interquartile Range"""
    return np.percentile(data, 75) - np.percentile(data, 25)

# Compare on data with outlier
data = [10, 12, 11, 13, 12, 11, 100]
print(f"SD: {np.std(data, ddof=1):.2f}")
print(f"MAD: {mad(data):.2f}")
print(f"Scaled MAD: {scaled_mad(data):.2f}")
print(f"IQR: {iqr(data):.2f}")

A statistics tutorial is a practical interpretation guide, not just a formula dump. It refers to the assumptions, notation, and reporting language that analysts need when they explain a result to a teacher, manager, client, or reviewer. The article body covers the specific topic, while the sections below create a common interpretation frame that readers can reuse across related metrics.

Reading goal	What to focus on	Common mistake
Definition	What the metric is and what quantity it summarizes	Treating the formula as self-explanatory
Formula choice	Sample versus population assumptions and notation	Using n when n-1 is required or vice versa
Interpretation	Whether the result indicates concentration, spread, or risk	Calling a large value good or bad without context

Frequently Asked Questions

How should I interpret a high standard deviation?

A high standard deviation means the observations are spread farther from the mean on average. Whether that spread is acceptable depends on the context: wide dispersion might signal risk in finance, instability in manufacturing, or genuine natural variation in scientific data.

Why do some articles mention n while others mention n-1?

The denominator reflects the difference between population and sample formulas. Population variance and population standard deviation use N because the full dataset is known. Sample variance and sample standard deviation often use n-1 because Bessel’s correction reduces bias when estimating population spread from a sample.

What is a statistical interpretation guide?

A statistical interpretation guide is a page that moves beyond arithmetic and explains meaning. It tells you what a metric is, when the formula applies, and how to describe the result in plain English without overstating certainty.

Can I cite this article in a report?

You should cite the underlying authoritative reference for formal work whenever possible. This page is best used as an explanatory bridge that helps you understand the concept before quoting the original standard or handbook.

Why include direct citations on every article page?

Direct citations give readers a route to verify the definition, notation, and assumptions. That improves trust and reduces the chance that a simplified explanation is mistaken for the entire technical standard.

Authoritative References

These sources define the concepts referenced most often across our articles. Bessel's correction is a sample adjustment, variance is a squared measure of spread, and standard deviation is the square root of variance expressed in the same units as the data.