Stat Trek Teach yourself statistics

Tutorials

AP Statistics

Stat Tables

Stat Tools

Calculators

Books

Help

Overview of tutorials | Advanced placement statistics | Introduction to probability and statistics | Matrix Algebra

AP Statistics Tutorial

Exploring Data

Planning a Study

Anticipating Patterns

Statistical Inference

Appendices

*	AP and Advanced Placement Program are registered trademarks of the College Board, which was not involved in the production of, and does not endorse this web site.

AP Statistics:
Residuals, Outliers, and Influential Points

A linear regression model is not always appropriate for the data. You can assess the appropriateness of the model by examining residuals, outliers, and influential points.

Residuals

The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual.

Residual = Observed value - Predicted value
e = y - ŷ

Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0.

Residual Plots

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

Below the table on the left presents results from a hypothetical regression analysis, and the chart on the right displays those results as a residual plot. In the chart, the independent variable (x) is math aptitude.

The residual plot shows a non-random pattern - negative residuals on the low end of the X axis and positive residuals on the high end. This indicates that a non-linear model will provide a much better fit to the data. Or it may be possible to "transform" the data to allow us to use a linear model. We discuss linear transformations in the next lesson.

x	95	85	80	70	60
y	85	95	70	65	70
ŷ	74.05	90.49	68.71	70.159	81.59
e	10.95	4.51	1.29	-5.159	-11.59

Below, the residual plots show three typical patterns. The first plot shows a random pattern, indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model.


Random pattern	Non-random: U-shaped curve	Non-random: Inverted U

Outliers

Data points that diverge from the overall pattern and have large residuals are called outliers.

Outliers limit the fit of the regression equation to the data. This is illustrated in the scatterplots below. The coefficient of determination is bigger when the outlier is not present.

Without Outlier		With Outlier

Regression equation: ŷ = 104.78 - 4.10x Coefficient of determination: R² = 0.94		Regression equation: ŷ = 97.51 - 3.32x Coefficient of determination: R² = 0.55

Influential Points

Influential points are data points with extreme values that greatly affect the the slope of the regression line.

The charts below compare regression statistics for a data set with and without an influential point. The chart on the right has a single influential point, located at the high end of the X axis (where x = 24). As a result of that single influential point, the slope of the regression line increases dramatically, from -2.5 to -1.6.

Note that this influential point, unlike the outliers discussed above, did not reduce the coefficient of determination. In fact, the coefficient of determination was bigger when the influential point was present.

Without Influential Point		With Influential Point

Regression equation: ŷ = 92.54 - 2.5x Slope: b₀ = -2.5 Coefficient of determination: R² = 0.46		Regression equation: ŷ = 87.59 - 1.6x Slope: b₀ = -1.6 Coefficient of determination: R² = 0.52

Test Your Understanding of This Lesson

In the context of regression analysis, which of the following statements are true?

I. When the sum of the residuals is greater than zero, the model is nonlinear.
II. Outliers reduce the coefficient of determination.
III. Influential points reduce the correlation coefficient.

(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III

Solution

The correct answer is (B). Outliers reduce the ability of a regression model to fit the data, and thus reduce the coefficient of determination. The sum of the residuals is always zero, whether the regression model is linear or nonlinear. And influential points often increase the correlation coefficient.