AP Statistics: Transformations to Achieve Linearity
When a
residual plot
reveals a data set to be nonlinear, it is often possible to
"transform" the raw data to make it linear. This allows us to use
linear regression
techniques appropriately with nonlinear data.
What is a Transformation to Achieve Linearity?
Transforming a variable involves using a mathematical operation to
change its measurement scale. Broadly speaking, there are two
kinds of transformations.
- Linear transformation. A linear transformation preserves linear
relationships between variables. Therefore, the
correlation
between x and y would be unchanged after a
linear transformation.
Examples of a linear transformation to variable x
would be multiplying x by a
constant, dividing x by a constant, or adding a constant
to x.
- Nonlinear tranformation. A nonlinear transformation changes
(increases or decreases) linear
relationships between variables and, thus, changes the
correlation between variables. Examples of a nonlinear
transformation of variable x would be taking the
square root of x or the reciprocal of x.
In regression, a transformation to achieve linearity is a
special kind of nonlinear transformation. It is a nonlinear
transformation that increases the linear
relationship between two variables.
Methods of Transforming Variables to Achieve Linearity
There are many ways to transform variables to achieve linearity
for regression analysis. Some common methods are summarized below.
Method |
Transformation(s) |
Regression equation |
Predicted value (ŷ) |
Standard linear regression |
None |
y = b0 + b1x |
ŷ = b0 + b1x |
Exponential model |
Dependent variable = log(y) |
log(y) = b0 + b1x |
ŷ = 10b0 + b1x |
Quadratic model |
Dependent variable = sqrt(y) |
sqrt(y) = b0 + b1x |
ŷ = ( = b0 + b1x )2 |
Reciprocal model |
Dependent variable = 1/y |
1/y = b0 + b1x |
ŷ = 1 / ( b0 + b1x ) |
Logarithmic model |
Independent variable = log(x) |
y= b0 + b1log(x) |
ŷ = b0 + b1log(x) |
Power model |
Dependent variable = log(y)
Independent variable = log(x) |
log(y)= b0 + b1log(x) |
ŷ = 10b0 + b1log(x) |
Each row shows a different nonlinear transformation method. The
second column shows the specific transformation applied to
dependent and/or independent variables. The third column shows
the regression equation used in the analysis. And the last
column shows the "back transformation" equation used to
restore the dependent variable to its original, non-transformed
measurement scale.
In practice, these methods need to be tested on the
data to which they are applied to be sure that they
increase rather than decrease the linearity
of the relationship. Testing the effect of a transformation
method involves looking at
residual
plots and correlation coefficients, as described in the
following sections.
Note: The logarithmic model and the power model
require the ability to work with
logarithms.
Use a
graphic calculator
to obtain the log of a number or to transform back from the logarithm
to the original number.
If you need it, the Stat Trek glossary has a brief
refresher on logarithms.
How to Perform a Transformation to Achieve Linearity
Transforming a data set to achieve linearity is a multi-step,
trial-and-error process.
- Choose a transformation method (see above table).
- Transform the independent variable, dependent variable, or both.
- Plot the independent variable against the
dependent variable, using the transformed data.
- If the
scatterplot
is linear, proceed to the next step.
- If the plot is not linear, return to Step 1 and try a
different approach. Choose a different transformation
method and/or transform a different variable.
- Conduct a regression analysis, using the transformed variables.
- Create a residual plot, based on regression results.
- If the residual plot shows a random pattern, the
transformation was successful. Congratulations!
- If the plot pattern is not random, return to Step 1
and try a different approach.
The best tranformation method (exponential model, quadratic
model, reciprocal model, etc.) will depend on nature of the
original data. The only way to determine which method is best
is to try each and compare the result (i.e.,
residual
plots, correlation coefficients).
A Transformation Example
Below, the table on the left shows data for independent and dependent
variables - x and y, respectively. When we apply a linear regression
to the raw data, the
residual
plot shows a non-random pattern (a U-shaped curve), which
suggests that the data are nonlinear.
x |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
y |
2 |
1 |
6 |
14 |
15 |
30 |
40 |
74 |
75 |
|
|
|
Suppose we repeat the analysis, using a quadratic model to transform
the dependent variable. For a quadratic model, we use the square
root of y, rather than y, as the dependent variable.
The table below shows the data we analyzed.
x |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
y |
1.14 |
1.00 |
2.45 |
3.74 |
3.87 |
5.48 |
6.32 |
8.60 |
8.66 |
|
|
|
The residual plot (above right) suggests that the transformation
to achieve linearity was successful. The pattern of residuals is
random, suggesting that the relationship between the independent
variable (x) and the transformed dependent variable (square root of y)
is linear. And the coefficient
of determination was 0.96 with the transformed data versus only
0.88 with the raw data. The transformed data resulted in a better
model.
Test Your Understanding of This Lesson
Problem
In the context of
regression
analysis,
which of the following statements is true?
I. A linear transformation increases the linear relationship
between variables.
II. A logarithmic model is the most effective transformation method.
III. A residual plot reveals departures from linearity.
(A) I only
(B) II only
(C) III only
(D) I and II only
(E) I, II, and III
Solution
The correct answer is (C). A linear transformation neither increases nor
decreases the linear relationship between variables; it preserves the
relationship. A nonlinear transformation is used to
increase the relationship between variables.
The most effective transformation method depends on the data
being transformed. In some cases, a logarithmic model may be more
effective than other methods; but it other cases it may be less
effective.
Non-random patterns in a
residual plot
suggest a departure from linearity in the data being plotted.
|