AP Statistics: Residuals, Outliers, and Influential Points
A linear regression model is not always appropriate for the data.
You can assess the appropriateness of the model by examining
residuals, outliers, and influential points.
Residuals
The difference between the observed value of the dependent variable
(y) and the predicted value (ŷ) is called the
residual (e). Each data point has one
residual.
Residual = Observed value - Predicted value
e = y - ŷ
Both the sum and the mean of the residuals are equal to zero. That is,
Σ e = 0 and e = 0.
Residual Plots
A residual plot is a graph that shows the
residuals on the vertical axis and the independent variable
on the horizontal axis. If the points in a residual plot
are randomly dispersed
around the horizontal axis, a linear regression model is
appropriate for the data; otherwise, a non-linear model is more
appropriate.
Below the table on the left presents results from a hypothetical regression
analysis, and the chart on the right displays those
results as a residual plot. In the chart, the independent variable (x)
is math aptitude.
The residual plot shows a non-random pattern - negative residuals
on the low end of the X axis and positive residuals on the high
end. This indicates that a non-linear model will provide a
much better fit to the data. Or it may be possible to "transform"
the data to allow us to use a linear model. We discuss
linear transformations in the
next lesson.
x |
95 |
85 |
80 |
70 |
60 |
y |
85 |
95 |
70 |
65 |
70 |
ŷ |
74.05 |
90.49 |
68.71 |
70.159 |
81.59 |
e |
10.95 |
4.51 |
1.29 |
-5.159 |
-11.59 |
|
|
|
Below, the residual plots show three typical patterns. The
first plot shows a random pattern, indicating a good
fit for a linear model. The other plot patterns are
non-random (U-shaped and inverted U), suggesting a better fit
for a non-linear model.
|
|
|
Random pattern |
Non-random: U-shaped curve |
Non-random: Inverted U |
Outliers
Data points that diverge from the overall pattern and have
large residuals are called
outliers.
Outliers limit the fit of the regression
equation to the data. This is illustrated in the scatterplots
below. The
coefficient of determination is bigger when the outlier
is not present.
Without Outlier
|
|
With Outlier
|
|
|
|
Regression equation: ŷ = 104.78 - 4.10x
Coefficient of determination: R2 = 0.94
|
|
Regression equation: ŷ = 97.51 - 3.32x
Coefficient of determination: R2 = 0.55
|
Influential Points
Influential points are data points with extreme values that greatly
affect the the
slope
of the regression line.
The charts below compare
regression statistics for a data set with and without an
influential point. The chart on the right has a single influential
point, located at the high end of the X axis (where x = 24).
As a result of that single influential point, the slope of the
regression line increases dramatically, from -2.5 to -1.6.
Note that this influential point, unlike the outliers discussed
above, did not reduce the coefficient of determination. In fact,
the coefficient of determination was bigger when the influential
point was present.
Without Influential Point
|
|
With Influential Point
|
|
|
|
Regression equation: ŷ = 92.54 - 2.5x
Slope: b0 = -2.5
Coefficient of determination: R2 = 0.46
|
|
Regression equation: ŷ = 87.59 - 1.6x
Slope: b0 = -1.6
Coefficient of determination: R2 = 0.52
|
Test Your Understanding of This Lesson
In the context of
regression
analysis,
which of the following statements are true?
I. When the sum of the residuals is greater than zero, the model is
nonlinear.
II. Outliers reduce the coefficient of determination.
III. Influential points reduce the correlation coefficient.
(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III
Solution
The correct answer is (B).
Outliers reduce the ability of a regression model to fit the
data, and thus reduce the
coefficient of determination.
The sum of the residuals is always zero, whether the regression model is
linear or nonlinear. And
influential points often increase the
correlation
coefficient.
|