Quick Answer
Use a 3 standard deviation threshold for conservative outlier screening when the data are stable and roughly normal. Use 2.5 standard deviations when missing a true anomaly is costly and you can review extra flags. Use 2 standard deviations only for early warning, monitoring, or triage, because normal data will naturally produce many values beyond that line.
Standard-deviation outlier rule
The cutoff is the decision. The z-score formula only converts a raw value into standard deviation units. If you need to compute z-scores directly, use the Z-Score Calculator. If you need the mean and sample standard deviation first, use the Sample Standard Deviation Calculator or Descriptive Statistics Calculator.
| Cutoff | Best use | Main risk |
|---|---|---|
| |z| >= 2 | Early warning screen or queue for review | Many normal observations are flagged |
| |z| >= 2.5 | Moderate anomaly screening with human review | Still sensitive to skew and inflated standard deviation |
| |z| >= 3 | Conservative outlier investigation for near-normal data | May miss smaller but operationally important anomalies |
| Modified z >= 3.5 | Data with suspected outliers already present | Requires median absolute deviation, not standard deviation |
Scenario: Analyst Choosing a Cutoff
A student or analyst often arrives here with a practical question: "I calculated the mean and standard deviation, but where should I draw the outlier line?" The senior-statistician answer is to choose the cutoff from the cost of a false alarm, the cost of missing a real issue, and the distribution shape.
The objective is not to delete values automatically. The objective is to produce a defensible review list: values beyond the cutoff should be investigated for data entry errors, measurement faults, special-cause events, or genuinely rare cases.
Role-based rule
For the broader method, see Detecting Outliers with Standard Deviation. If the data are skewed, heavy-tailed, or already contaminated by extreme values, compare this rule with Modified Z-Score Outlier Detection and Robust Statistics.
Formula and Threshold Choices
Sample-based z-score
Here, x is the value being screened, xbar is the sample mean, and s is the sample standard deviation. If you have the complete population rather than a sample, use the population mean and population standard deviation instead. The interpretation is still distance from the mean in standard deviation units.
- 2 sigma:Good for attention signals, not final outlier claims. In a normal distribution, values beyond 2 standard deviations are uncommon but not rare.
- 2.5 sigma:A useful compromise when a team can inspect a larger review queue and wants to catch moderate anomalies.
- 3 sigma:The standard starting point for conservative outlier investigation under an approximately normal, stable process.
- No fixed cutoff:Use a domain limit instead when a regulatory, clinical, manufacturing, or financial threshold already defines the decision.
Do not tune the cutoff after seeing the weird value
Worked Example: Support Ticket Times
A support analyst reviews same-day resolution times in minutes for 12 tickets: `41, 44, 45, 47, 48, 49, 50, 51, 52, 54, 55, 82`. The team wants to decide whether the 82-minute ticket should be investigated as an unusual case.
The sample mean is 51.5 minutes. The sample standard deviation is 10.75 minutes. For the 82-minute ticket, `z = (82 - 51.5) / 10.75 = 2.84`.
| Ticket time | Difference from mean | z-score | Flag at 2? | Flag at 2.5? | Flag at 3? |
|---|---|---|---|---|---|
| 41 | -10.5 | -0.98 | No | No | No |
| 44 | -7.5 | -0.70 | No | No | No |
| 55 | 3.5 | 0.33 | No | No | No |
| 82 | 30.5 | 2.84 | Yes | Yes | No |
Decision from the example
This example also shows a circular problem: the 82-minute value increases the mean and standard deviation used to judge it. If the same data are screened with a robust method, the modified z-score is much larger. That is why standard-deviation cutoffs work best when extreme contamination is not already driving the spread estimate.
Expected False Flags Under Normality
Under a normal model, the cutoff controls how many ordinary observations you should expect to flag by chance. This is why a 2-sigma screen feels useful on a dashboard but becomes noisy in a large dataset.
| Two-sided cutoff | Approximate normal probability outside cutoff | Expected flags in 100 observations | Expected flags in 10,000 observations |
|---|---|---|---|
| |z| >= 2 | About 4.55% | About 5 | About 455 |
| |z| >= 2.5 | About 1.24% | About 1 | About 124 |
| |z| >= 3 | About 0.27% | About 0 to 1 | About 27 |
| |z| >= 3.5 | About 0.047% | About 0 | About 5 |
These probabilities assume a stable normal distribution. If the data are skewed, clustered, autocorrelated, rounded, censored, or mixed from multiple groups, the false-flag table can be misleading. For distribution-based interpretation, compare Standard Deviation and Normal Distribution and Empirical Rule vs Chebyshev's Theorem.
Decision Criteria
Choose the cutoff by matching the statistical flag to the cost of action. A low cutoff creates more review work; a high cutoff misses more borderline anomalies. The right cutoff is the one your team can explain before looking at the flagged rows.
| Situation | Recommended starting point | Reason |
|---|---|---|
| Homework or textbook normal-data exercise | 3 sigma | Matches the common three-sigma outlier rule and keeps false flags low |
| Operations dashboard with analyst review | 2.5 sigma | Catches more candidates while keeping the queue manageable |
| Safety, fraud, or data-quality triage | 2 or 2.5 sigma plus domain rules | Missing a true signal may be more expensive than reviewing extra flags |
| Small sample with one extreme value | Modified z-score or IQR rule | The extreme value can inflate the mean and standard deviation |
| Known specification limit | Use the specification first | A real tolerance, service-level, or regulatory limit outranks a generic sigma rule |
Threshold Checklist
- State whether the data are a sample or a complete population before calculating the standard deviation.
- Plot or summarize the distribution shape before trusting a normal-based cutoff.
- Pick the cutoff before inspecting individual flagged rows.
- Estimate the expected number of false flags for the dataset size.
- Use robust methods when one or two values already dominate the mean or standard deviation.
- Investigate flagged values; do not delete them without a measurement, entry, or process explanation.
- Document the business or scientific reason for the cutoff, especially when using 2 or 2.5 sigma.
Pre-publish self-check
Weakest Section Rewrite
Weak version: "Use 3 standard deviations because it is a common rule for finding outliers."
Concrete substitution: "Use 3 standard deviations when the process is stable, roughly normal, and the cost of a false alarm is meaningful. In 10,000 normal observations, a 3-sigma two-sided rule still flags about 27 values by chance; a 2-sigma rule flags about 455. That difference changes staffing, alert fatigue, and the credibility of the review queue."
Further Reading
Sources
References and further authoritative reading used in preparing this article.