Thống kê bền vững: MAD, IQR và các phương pháp kháng giá trị ngoại lai

Tại sao cần thống kê bền vững?

Độ lệch chuẩn là đại lượng đo độ phân tán mạnh mẽ, nhưng nó có một điểm yếu nghiêm trọng: cực kỳ nhạy cảm với giá trị ngoại lai. Một giá trị cực đoan duy nhất có thể làm SD tăng đáng kể, cho bức tranh sai lệch về biến thiên điển hình.

Thống kê bền vững cung cấp các đại lượng đo độ phân tán chống lại ảnh hưởng của ngoại lai, khiến chúng thiết yếu cho dữ liệu thực tế nơi lỗi đo lường, nhập liệu sai hoặc trường hợp cực đoan thực sự là phổ biến.

Ví dụ: Ảnh hưởng của giá trị ngoại lai

Dữ liệu: 10, 12, 11, 13, 12, 11, 100 (một ngoại lai) Độ lệch chuẩn: 32,4 (bị chi phối bởi ngoại lai) MAD: 1,0 (bỏ qua ngoại lai) IQR: 1,5 (bỏ qua ngoại lai)

Điểm sụp đổ

“Điểm sụp đổ” của một thống kê là tỷ lệ dữ liệu có thể cực đoan trước khi thống kê trở nên vô nghĩa. SD có điểm sụp đổ 0% (một ngoại lai có thể phá hủy nó). MAD và IQR có điểm sụp đổ 50%—nửa dữ liệu có thể là ngoại lai và chúng vẫn hoạt động.

Độ lệch tuyệt đối trung vị (MAD)

MAD là đại lượng đo độ phân tán bền vững nhất. Nó tính trung vị của các độ lệch tuyệt đối từ trung vị:

Công thức MAD

MAD = median(|xᵢ - median(x)|)

Tìm trung vị

Tính trung vị của tập dữ liệu.

Tính độ lệch

Trừ trung vị từ mỗi giá trị và lấy giá trị tuyệt đối.

Tìm MAD

Tính trung vị của các độ lệch tuyệt đối này.

Quy đổi MAD để ước lượng σ: Với dữ liệu phân phối chuẩn, MAD ≈ 0,6745 × σ. Để ước lượng SD từ MAD, nhân với 1,4826:

Ước lượng SD từ MAD

σ̂ = 1.4826 × MAD

Tại sao 1,4826?

Hệ số quy đổi này đến từ mối quan hệ giữa MAD và SD cho phân phối chuẩn. Nó đảm bảo MAD đã quy đổi là ước lượng không chệch cho độ lệch chuẩn thực khi dữ liệu phân phối chuẩn.

Khoảng tứ phân vị (IQR)

IQR đo độ phân tán của 50% dữ liệu ở giữa—khoảng giữa phân vị thứ 25 và 75:

Công thức IQR

IQR = Q3 - Q1 = 75th percentile - 25th percentile

IQR được sử dụng rộng rãi vì đơn giản, dễ trực quan hóa trong biểu đồ hộp và là cơ sở cho quy tắc phổ biến “1,5×IQR” để phát hiện ngoại lai.

Quy đổi IQR để ước lượng σ: Với dữ liệu chuẩn, IQR ≈ 1,35 × σ. Để ước lượng SD từ IQR:

Ước lượng SD từ IQR

σ̂ = IQR / 1.35 ≈ 0.7413 × IQR

So sánh các đại lượng bền vững

Độ lệch chuẩn

Sử dụng tất cả điểm dữ liệu · Hiệu quả nhất cho dữ liệu chuẩn · Rất nhạy với ngoại lai · Điểm sụp đổ: 0%

MAD

Đại lượng bền vững nhất · Sử dụng trung vị (không phải trung bình) · Miễn nhiễm với mọi ngoại lai · Điểm sụp đổ: 50%

IQR

Dễ hiểu · Dùng trong biểu đồ hộp · Bỏ qua 50% cực đoan · Điểm sụp đổ: 25%

Khi nào sử dụng thống kê bền vững

Phân tích khám phá: Khi bạn không biết liệu ngoại lai có tồn tại, hãy bắt đầu với đại lượng bền vững
Vấn đề chất lượng dữ liệu: Khi dữ liệu có thể chứa lỗi hoặc vấn đề đo lường
Phân phối đuôi nặng: Khi giá trị cực đoan được kỳ vọng (lợi nhuận tài chính, yêu cầu bảo hiểm)
Mẫu nhỏ: Khi ngoại lai có tác động lớn quá mức do ít quan sát
Phát hiện ngoại lai: Sử dụng SD để phát hiện ngoại lai là vòng tròn; dùng IQR hoặc MAD thay thế

Ví dụ triển khai

Python

import numpy as np
from scipy import stats

def mad(data):
    """Median Absolute Deviation"""
    median = np.median(data)
    return np.median(np.abs(data - median))

def scaled_mad(data):
    """MAD scaled to estimate SD (for normal data)"""
    return 1.4826 * mad(data)

def iqr(data):
    """Interquartile Range"""
    return np.percentile(data, 75) - np.percentile(data, 25)

# Compare on data with outlier
data = [10, 12, 11, 13, 12, 11, 100]
print(f"SD: {np.std(data, ddof=1):.2f}")
print(f"MAD: {mad(data):.2f}")
print(f"Scaled MAD: {scaled_mad(data):.2f}")
print(f"IQR: {iqr(data):.2f}")

A statistics tutorial is a practical interpretation guide, not just a formula dump. It refers to the assumptions, notation, and reporting language that analysts need when they explain a result to a teacher, manager, client, or reviewer. The article body covers the specific topic, while the sections below create a common interpretation frame that readers can reuse across related metrics.

Reading goal	What to focus on	Common mistake
Definition	What the metric is and what quantity it summarizes	Treating the formula as self-explanatory
Formula choice	Sample versus population assumptions and notation	Using n when n-1 is required or vice versa
Interpretation	Whether the result indicates concentration, spread, or risk	Calling a large value good or bad without context

Frequently Asked Questions

How should I interpret a high standard deviation?

A high standard deviation means the observations are spread farther from the mean on average. Whether that spread is acceptable depends on the context: wide dispersion might signal risk in finance, instability in manufacturing, or genuine natural variation in scientific data.

Why do some articles mention n while others mention n-1?

The denominator reflects the difference between population and sample formulas. Population variance and population standard deviation use N because the full dataset is known. Sample variance and sample standard deviation often use n-1 because Bessel’s correction reduces bias when estimating population spread from a sample.

What is a statistical interpretation guide?

A statistical interpretation guide is a page that moves beyond arithmetic and explains meaning. It tells you what a metric is, when the formula applies, and how to describe the result in plain English without overstating certainty.

Can I cite this article in a report?

You should cite the underlying authoritative reference for formal work whenever possible. This page is best used as an explanatory bridge that helps you understand the concept before quoting the original standard or handbook.

Why include direct citations on every article page?

Direct citations give readers a route to verify the definition, notation, and assumptions. That improves trust and reduces the chance that a simplified explanation is mistaken for the entire technical standard.

Authoritative References

These sources define the concepts referenced most often across our articles. Bessel's correction is a sample adjustment, variance is a squared measure of spread, and standard deviation is the square root of variance expressed in the same units as the data.