Multiple Regression

Overview

Multiple regression, a time-honored technique going back to Pearson's 1908 use of it, is employed to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish that a set of independent variables explains a proportion of the variance in a dependent variable at a significant level (through a significance test of R²), and can establish the relative predictive importance of the independent variables (by comparing beta weights). Power terms can be added as independent variables to explore curvilinear effects. Cross-product terms can be added as independent variables to explore interaction effects. One can test the significance of difference of two R²'s to determine if adding an independent variable to the model helps significantly. Using hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients and constant) can be used to construct a prediction equation and generate predicted scores on a variable for further analysis.
The multiple regression equation takes the form y = b₁x₁ + b₂x₂ + ... + b_nx_n + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R², multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables.
Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables.
See also a variety of alternatives related to OLS regression:

Curve estimation is SPSS's curve-fitting module, letting the researcher explore how linear regression compares to any of 10 nonlinear models, for the case of one independent predicting one dependent, and thus is useful for exploring which procedures and models may be appropriate for relationships in one's data.
General linear model (multivariate).. Multiple regression with just covariates (and/or with dummy variables) yields the same inferences as multiple analysis of variance (Manova), to which it is statistically equivalent. GLM can implement regression models with multiple dependents.
Generalized linear models and generalized estimating equations are the generalization of linear modeling to a form covering almost any dependent distribution with almost any link function, thus supporting linear regression, Poisson regression, gamma regression, and many others.
Nonlinear regression, used when the model is inherently nonlinear (nonlinearities cannot be dealt with using link functions in generalized linear models or by power and other transformations in general linear models, including regression).
Logistic regression is used for dichotomous and multinomial dependents, implemented here with logistic procedures and above in generalized linear models.
Weighted least squares (WLS) regression may be used when the assumption of homoscedasticity has been violated.
Cox regression may be used to analyze time-to-event as well as proximity, and preference data.
Discriminant function analysis. is used when the dependent variable is a dichotomy but other assumptions of multiple regression can be met, making it more powerful than the alternative, which is logistic regression for binary or multinomial dependents.
Partial least squares regression may be used even with small datasets to predict a set of response variables from a set of independent variables.
Logit regression uses log-linear techniques to predict one or more categorical dependent variables.
Poisson regression for count data in event history analysis and more, implemented here with general loglinear procedures and above in generalized linear models.
Categorical regression is a variant which can handle nominal independent variables, but now largely replaced by generalized linear models..

Contents

Key concepts and terms
Variables
Regression coefficients
Significance tests
Effect size measures
Residual analysis
Multicollinearity
Stepwise regression
Hierarchical regression
Panel data regression
Assumptions
Frequently asked questions
Bibliography

Key Terms and Concepts

Variables
- The regression equation takes the form Y = b₁*x₁ + b₂*x₂ + c + e, where Y is the true dependent, the b's are the regression coefficients for the corresponding x (independent) terms, where c is the constant or intercept, and e is the error term reflected in the residuals. Sometimes this is expressed more simply as y = b₁*x₁ + b₂*x₂ + c, where y is the estimated dependent and c is the constant (which includes the error term). Equations such as that above, with no interaction effects (see below), are called main effects models. In SPSS, select Analyze, Regression, Linear; select your dependent and independent variables; click Statistics; select Estimates, Confidence Intervals, Model Fit; continue; OK.
- Dependent variable. The dependent variable is the predicted variable in the regression equation. Dependents are assumed to be continuous, interval variables, though it is common to see use or ordinal data in linear regression. Use of binary variables as dependents is considered an excessive violation of the data level requirements of multiple regression: use binary logistic regression or discriminant analysis instead. See the discussion in the "Assumptions" section.
- Independent variables are the predictor variables in the regression equation. Predictors are assumed to be continuous, interval variables, though it is common to see use or ordinal data in linear regression. See the discussion in the "Assumptions" section.
- Predicted values, also called fitted values, are the values of each case based on using the regression equation for all cases in the analysis. In SPSS, dialog boxes use the term PRED to refer to predicted values and ZPRED to refer to standardized predicted values. Click the Save button in SPSS to add and save these as new variabls in your dataset.
- Adjusted predicted values are the values of each case based on using the regression equation for all cases in the analysis except the given case.
- Dummy variables are a way of adding the values of a nominal or ordinal variable to a regression equation. The standard approach to modeling categorical variables is to include the categorical variables in the regression equation by converting each level of each categorical variable into a variable of its own, usually coded 0 or 1. For instance, the categorical variable "region" may be converted into dummy variables such as "East," "West," "North," or "South." Typically "1" means the attribute of interest is present (ex., South = 1 means the case is from the region South). Of course, once the conversion is made, if we know a case's value on all the levels of a categorical variable except one, that last one is determined. We have to leave one of the levels out of the regression model to avoid perfect multicollinearity (singularity; redundancy), which will prevent a solution (for example, we may leave out "North" to avoid singularity). The omitted category is the reference category because b coefficients must be interpreted with reference to it, as discussed below. Note that dummy variables are not used as dependents in a normal regression model. See further discussion in the FAQ section, including alternative methods of coding categorical dummies.
  - Why not run separate regressions? It is true that one approach to a categorical variable in regression would be to run separate regressions for each category. While this is feasible for a single variable such as gender, running a male and a female regression, it is not the best approach for two reasons. In practical terms, if there are multiple categorical variables each with multiple categories (levels), the number of needed regressions may become unwieldy. In statistical terms, we will lose power since each regression will have a smaller sample size than if there were one overall regression. That is, we will be more likely to make Type II errors (false negatives, thinking there is no relationship when in reality there is).
- Interaction effects are sometimes called moderator effects because the interacting third variable which changes the relation between two original variables is a moderator variable which moderates the original relationship. For instance, the relation between income and conservatism may be moderated depending on the level of education.
  - Interaction terms may be added to the model to incorporate the joint effect of two variables (ex., income and education) on a dependent variable (ex., conservatism) over and above their separate effects. One adds interaction terms to the model as crossproducts of the standardized independents and/or dummy independents, typically placing them after the simple "main effects" independent variables. In SPSS syntax mode, to create a new interaction variable X12 from variables X1 and X2, simply issue the command COMP X12 = X1*X2. In SAS, you can model directly: MODEL y = X1 + X2 +(X1*X2). Some computer programs will allow the researcher to specify the pairs of interacting variables and will do all the computation automatically. Crossproduct interaction terms may be highly correlated (multicollinear - see below) with the corresponding simple independent variables in the regression equation, creating problems with assessing the relative importance of main effects and interaction effects. Note: Because of possible multicollinearity, it may well be desirable to use centered variables (where one has subtracted the mean from each datum) -- a transformation which often reduces multicollinearity. Note also that there are alternatives to the crossproduct approach to analyzing interactions.
  - Interaction terms involving categorical dummies. To create an interaction term between a categorical variable and a continuous variable, first the categorical variable is dummy-coded, creating (k - 1) new variables, one for each level of the categorical variable except the omitted reference category. The continuous variable is mutliplied by each of the (k - 1) dummy variables. The terms entered into the regression include the continuous variable, the (k - 1) dummy variables, and the (k - 1) cross-product interaction terms. Also, a regression is run without the interaction terms. The R-squared difference measures the effect of the interaction. The beta weights for the interaction terms in the regression which includes the interaction terms measure the relative predictive power of the effects of the continuous variable given specific levels of the categorical variable. It is also possible to include both interaction terms and power terms in the model by multiplying the dummies by the square of the continuous variable, but this can lead to an excessive number of terms in the model. One approach to dealing with this is to use stepwise regression's F test as a criterion to stop adding terms to the model.
  - The significance of an interaction effect is the same as for any other variable, except in the case of a set of dummy variables representing a single ordinal variable. When an ordinal variable has been entered as a set of dummy variables, the interaction of another variable with the ordinal variable will involve multiple interaction terms. In this case the F-test of the significance of the interaction of the two variables is the significance of the change of R-square of the equation with the interaction terms and the equation without the set of terms associated with the ordinal variable.
  - Separate regressions. An alternative approach to interactions is to run separate regressions for each level of the interacting variable.
- Centering data refers to subtracting the mean (or some other value) from all observations, making the new 0 point equal to the mean. A major purpose of centering independent variables is that this may eliminate multicollinearity problems in the data because the correlation with other independents may be reduced. That is, centering will not change the assessment of the significance of an independent variable but it may change the correlations of that variable with others. Since interaction terms are particularly prone to multicollinearity, centering is particularly but not exclusively appropriate for variables whose interactions are being modeled. Centering is almost always recommended for such variables but it may be useful for any independent variable in a regression or logistic regression model.
  - Interpretation of slopes and intercepts. A second and equally important purpose of centering is to make interpretation of coefficients more meaningful in regression, logistic regress, and other models. Choosing where to center is analogous to picking the reference category for a set of dummy variables: interpretation is easiest when the reference point is meaningful. For instance, it may be more meaningful to center on the median rather than the mean in some cases. Centering, by making the most common or otherwise meaningful values correspond to 0, may make coefficients more easily interpreted for these reasons:
    1. Intercepts. Before centering, the intercept is the value of the dependent (regression) or ln(dependent) (in logistic regression) expected when the independent is 0 - a value which may be meaningless (ex., age=0). After centering, the intercept is the value when the independent is at the mean (ex., age=30).
    2. Nonlinear terms. When the researcher is testing nonlinearity by adding a quadratic or other exponential term to the model, the meaning of the regression coefficient changes. For a simple model (y = bx + c), the b coefficient is the amount y changes for a unit increase in x for the entire range of possible y values. For a quadratic model (y = b₁x _ b₂x² + c), the b₁ coefficient reflects the linear trend when y is 0 while the b₂ coefficient reflects the nonlinear trend across the entire range of y. Centering makes the b₁ coefficient meaningful because it becomes the linear trend when y is a meaningful value..
    3. Interaction terms. The same logic as for nonlinear terms applies to interaction terms, which act like nonlinear terms with regard to interpretation of b coefficients.
  - When not needed. When there is no multicollinearity and when the zero value of the independent already is close to the mean or median, centering is unnecessary (see Jewell, 2003 : 233). Selecting which variables to center involves researcher judgment and not all independents must be centered. It is not customary to center independents not used in interactions or quadratic terms.
Regression coefficients
- OLS stands for ordinary least squares. This derives its name from the criterion used to draw the best fit regression line: a line such that the sum of the squared deviations of the distances of all the points to the line is minimized.
- The regression coefficient, b, is the average amount the dependent increases when the independent increases one unit and other independents are held constant. Put another way, the b coefficient is the slope of the regression line: the larger the b, the steeper the slope, the more the dependent changes for each unit change in the independent. The b coefficient is the unstandardized simple regression coefficient for the case of one independent. When there are two or more independents, the b coefficient is a partial regression coefficient, though it is common simply to call it a "regression coefficient" also. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Estimates is checked to get the b coefficients (the default).
  - Coefficients table. The b coefficient is shown in the Coefficients table, shown above (in original format it is one long horizontal table, reformatted here for readability without scrolling). The figure above is for the example of automobile accidents predicted from gender and age. The Sig column tells us that gender is significant but not age as a predictor of number of accidents. The ratio of the beta weights for the independent variables would tell us the relative importance of the independents, but here only gender is significant. In the Confidence Intervals section we see that 0 is between the upper and lower bounds for the b coefficient for age, meaning that the coefficient for age cannot be assumed at the 95% confidence level to be different from 0 (this is a different way of saying age is not significant). Because sex is significant, 0 is not within its upper and lower confidence bounds. The zero-order and partial correlations for sex are the same because when you control the zero-order correlation of sex with accidents for age, age has no influence. When you control the zero-order correlation of age with accidents for sex, the tiny zero-order correlation (which was not significant to begin with) drops to 0. Part (semi-partial) correlation for sex is the same as the zero-order correlation and not lower than the partial correlation because age has no control effect, whether one is talking about controlling both accidents and sex for age (partial correlation) or controlling just sex for age (part correlation). Tolerance and VIF are tests for multicollinearity (overly high correlation among the independents), which is zero problem here as age is correlated with sex at the trivial -.009 level (shown in the SPSS Correlations table of regression output), which corresponds after rounding to tolerance and VIF both equalling 1, which of course means no multicollinearity problem by the common rule of thumb that only VIF>4.0 indicates a multicollinearity problem.
  - b coefficients compared to partial correlation coefficients. The b coefficient is a semi-partial coefficient, in contrast to partial coefficients as found in partial correlation. The partial coefficient for a given independent variable removes the variance explained by control variables from both the independent and the dependent, then assesses the remaining correlation. In contrast, a semi-partial coefficient removes the variance only from the independent. That is, where partial coefficients look at total variance of the dependent variable, semi-partial coefficients look at the variance in the dependent after variance accounted for by control variables is removed. Thus the b coefficients, as semi-partial coefficients, reflect the unique (independent) contributions of each independent variable to explaining the total variance in the dependent variable. This is discussed further in the example and part correlation sections on partial correlation.
  - Interpreting b for dummy variables. The interpretation of b coefficients is different when dummy variables are present. Normally, without dummy variables, the b coefficient is the amount the dependent variable increases when the independent variable associated with the b increases by one unit. When using a dummy variable such as "region" in the example above, the b coefficient is how much more the dependent variable increases (or decreases if b is negative) when the dummy variable increases one unit (thus shifting from 0=not present to 1=present, such as South=1=case is from the South) compared to the reference category (North, in our example). Thus for the set of dummy variables for "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for the dummy "South" means that the expected education level for the South is 1.5 years less than the average of "North" respondents. Dummy variables and their interpretation under alternative forms of coding is discussed below.
Significance testing
- Dynamic inference is drawing the interpretation that the dependent changes b units because the independent changes one unit. That is, one assumes that there is a change process (a dynamic) which directly relates unit changes in x to b changes in y. This assumption implies two further assumptions which may or may not be true: (1) b is stable for all subsamples or the population (cross-unit invariance) and thus is not an artificial average which is often unrepresentative of particular groups; and (2) b is stable across time when later re-samples of the population are taken (cross-time invariance).
- t-tests are used to assess the significance of individual b coefficients. specifically testing the null hypothesis that the regression coefficient is zero. A common rule of thumb is to drop from the equation all variables not significant at the .05 level or better. Note that restricted variance of the independent variable in the particular sample at hand can be a cause of a finding of no significance. Like all significance tests, the t-test assumes randomly sampled data. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Estimates is checked to get t and the significance of b.
  Note: t-tests are not used for dummy variables, even though SPSS and other statistical packages output them -- see Frequently Asked Questions section below. Note also that the t-test is a test only of the unique variance an independent variable accounts for, not of shared variance it may also explain, as shared variance while incorporated in R², is not reflected in the b coefficient.
- t-tests for dummy variables. The significance of dummy variables must be assessed as a set, using the R²-change method discussed below (and ignoring the indiivdual t-tests SPSS will produce by default for each dummy b coefficient).
  One- vs. two-tailed t tests. Also note that t-tests in SPSS and SAS are two-tailed, which means they test the hypothesis that the b coefficient is either significantly higher or lower than zero. If our model is such that we can rule out one direction (ex., negative coefficients) and thus should test only if the b coefficient is more than zero, we want a one-tailed test. The two-tailed significance level will be twice the one-tailed probability level: if SPSS reports .10 for a two-tailed test, for instance, then the one-tailed equivalent significance level is .05.
- Standard Error of Estimate (SEE), confidence intervals, and prediction intervals. Confidence intervals around the mean are discussed in the section on significance. In regression, however, the confidence refers to more than one thing. Note the confidence and prediction intervals will improve (narrow) if sample size is increased, or the confidence level is decreased (ex., from 95% to 90%).
  For large samples, SEE approximates the standard error of a predicted value. SEE is the standard deviation of the residuals. In a good model, SEE will be markedly less than the standard deviation of the dependent variable. In a good model, the mean of the dependent variable will be greater than 1.96 times SEE.
  - The confidence interval of the regression coefficient. Based on t-tests, the confidence interval is the plus/minus range around the observed sample regression coefficient, within which we can be, say, 95% confident the real regression coefficient for the population regression lies. Confidence limits are relevant only to random sample datasets. If the confidence interval includes 0, then there is no significant linear relationship between x and y. We then do not reject the null bypothesis that x is independent of y. In SPSS, Analyze, Regression, Linear; click Statistics; check Confidence Limits to get t and confidence limits on b.
  - The confidence interval of y (the dependent variable) is also called the standard error of mean prediction. Some 95 times out of a hundred, the true mean of y will be within the confidence limits around the observed mean of n sampled cases.That is, the confidence interval is the upper and lower bounds for the mean predicted response. Note the confidence interval of y deals with the mean, not an individual case of y. Moreover, the confidence interval is narrower than the prediction interval, which deals with individual cases. Note a number of textbooks do not distinguish between confidence and prediction intervals and confound this difference. In SPSS, select Analyze, Regression, Linear; click Save; under "Prediction intervals" check "Mean" and under "Confidence interval" set the confidence level you want (ex., 95%). Note SPSS calls this a prediction interval for the mean.
  - The prediction interval of y. For the 95% confidence limits, the prediction interval on a fitted value is plus/minus is the estimated value plus or minus 1.96 times SQRT(SEE + S²_y), where S²_y is the standard error of the mean prediction. Prediction intervals are upper and lower bounds for the prediction of the dependent variable for a single case. Thus some 95 times out of a hundred, a case with the given values on the independent variables would lie within the computed prediction limits. The prediction interval will be wider (less certain) than the confidence interval, since it deals with an interval estimate of cases, not means. In SPSS, select Analyze, Regression, Linear; click Save; under "Prediction intervals" check "Individual" and under "Confidence interval" set the confidence level you want (ex., 95%).
- F test: The F test is used to test the significance of R, which is the same as testing the significance of R², which is the same as testing the significance of the regression model as a whole. If prob(F) < .05, then the model is considered significantly better than would be expected by chance and we reject the null hypothesis of no linear relationship of y to the independents. F is a function of R², the number of independents, and the number of cases. F is computed with k and (n - k - 1) degrees of freedom, where k = number of terms in the equation not counting the constant.
  F = [R²/k]/[(1 - R² )/(n - k - 1)]. Alternatively, F is the ratio of mean square for the model (labeled Regression) divided by mean square for error (labeled Residual), where the mean squarea are the respective sums of squares divided by the degrees of freedom. Thus in the figure below, F = 16.129/2.294 = 7.031.
  
  In SPSS, the F test appears in the ANOVA table, shown above for the example of number of auto accidents predicted from gender and age. Note that the F test is too lenient for the stepwise method of estimating regression coefficients and an adjustment to F is recommended (see Tabachnick and Fidell, 2001: 143 and Table C.5). In SPSS, select Analyze, Regression, Linear; click Statistics; make sure Model fit is checked to get the ANOVA table and the F test. Here the model is significant at the .001 level, which is the same as shown in the Model Summary table.
- Partial F test: Partial-F can be used to assess the significance of the difference of two R²'s for nested models. Nested means one is a subset of the other, as a model with interaction terms and one without. Also, unique effects of individual independents can be assessed by running a model with and without a given independent, then taking partial F to test the difference. In this way, partial F plays a critical role in the trial-and-error process of model-building.
  In SPSS, run Analyze, Regression, Linear for the larger and smaller models. The ANOVA table, printed as default output, will list the sum of square (RSS) and the corresponding df. Plug these values into the equation above to computer partial F for testing the difference between the models, then find the prob(F) in an F table with df1, df2 degrees of freedom. An alternate equation for testing the difference of models is given below.
Effect size measures
- The beta weights are the regression (b) coefficients for standardized data. Beta is the average amount the dependent increases when the independent increases one standard deviation and other independent variables are held constant. If an independent variable has a beta weight of .5, this means that when other independents are held constant, the dependent variable will increase by half a standard deviation (.5 also). The ratio of the beta weights is the ratio of the estimated unique predictive importance of the independents. Note that the betas will change if variables or interaction terms are added or deleted from the equation. Reordering the variables without adding or deleting will not affect the beta weights. That is, the beta weights help assess the unique importance of the independent variables relative to the given model embodied in the regression equation. Note that adding or subtracting variables from the model can cause the b and beta weights to change markedly, possibly leading the researcher to conclude that an independent variable initially perceived as unimportant is actually and important variable. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Estimates is checked to get the beta coefficients (the default). Beta weights appear in the Coefficients table in SPSS output.
  - Unique v. joint effects. Note that the betas reflect the unique contribution of each independent variable. Joint contributions contribute to R-square but are not attributed to any particular independent variable. The result is that the betas may underestimate the importance of a variable which makes strong joint contributions to explaining the dependent variable but which does not make a strong unique contribution. Thus when reporting relative betas, one must also report the correlation of the independent variable with the dependent variable as well, to acknowledge if it has a strong correlation with the dependent variable. When assessing the relative importance of independents, light is thrown on the ratio of beta weights by also looking at the correlation and semipartial (part) correlations (discused below) of a given independent with the dependent.
  - Standardization and comparability of variables. "Standardized" means that for each datum the mean is subtracted and the result divided by the standard deviation. The result is that all variables have a mean of 0 and a standard deviation of 1. This enables comparison of variables of differing magnitudes and dispersions. Only standardized b-coefficients (beta weights) can be compared to judge relative predictive power of independent variables. In general, whenever the metric of two variables in a regression model differ, standardization is needed before comparison. In addition, when the metric is arbitrary (as it is when Likert items are coded 1 - 5 for strongly disagree to strongly agree), Menard (1995, 2002) and others recommend standardization.
  - Beta weights over 1.0. It is perfectly possible for some or even all beta weights to be greater than 1.0. While correlations should reach a maximum of 1.0, beta weights are not correlations except in the rare case where variables are orthogonal, as when all variables are factor scores from a factor analysis with varimax or another orthogonal rotation. Beta weights greater than 1 may indicate the presence of some degree of multicollinearity or the presence of a suppressor effect among variables in the model. See Joreskog (1999).
  - Labeling. Note some authors use "b" to refer to sample regression coefficients, and "beta" to refer to regression coefficients for population data. They then refer to "standardized beta" for what is simply called the "beta weight" here.
- Level-importance is the b coefficient times the mean for the corresponding independent variable. The sum of the level importance contributions for all the independents, plus the constant, equals the mean of the dependent variable. Achen (1982: 72) notes that the b coefficient may be conceived as the "potential influence" of the independent on the dependent, while level importance may be conceived as the "actual influence." This contrast is based on the idea that the higher the b, the more y will change for each unit increase in b, but the lower the mean for the given independent, the fewer actual unit changes will be expected. By taking both the magnitude of b and the magnitude of the mean value into account, level importance is a better indicator of expected actual influence of the independent on the dependent. Level importance is not computed by SPSS.
- The model comparison method, sometimes called the dropping method, of assessing the relative importance of independent variables (IVs) is an alternative to the beta weight method, and is often preferred when the purpose is to utilize a data-driven procedure to reduce a larger regression model to one with fewer IVs. In the model comparison method, one runs a regression model over and over, dropping a single IV each time. The IVs, dropping of which does not appreciably reduce R-square, are treated as candidates to be dropped from the model. (Of course, theory-driven rather than data-driven modeling is preferred).
- Correlation: Pearson's r² is the percent of variance in the dependent explained by the given independent when (unlike the beta weights) all other independents are allowed to vary. The result is that the magnitude of r² reflects not only the unique covariance it shares with the dependent, but uncontrolled effects on the dependent attributable to covariance the given independent shares with other independents in the model. A rule of thumb is that multicollinearity may be a problem if a correlation is > .90 or several are >.7 in the correlation matrix formed by all the independents. The correlations of the independent variables with the dependent appear in the Model Summary table in SPSS output. The correlations among the independents, useful for checking possible multicollinearity, appear in the Coefficients Correlations table below.
  - Semipartial correlation, also called part correlation: Semipartial correlation, reported as "part corr" by SPSS, in its squared form is the percent of full variance in the dependent uniquely and jointly attributable to the given independent when other variables in the equation are controlled (not allowed to vary). The linear effects of the other independents are removed from the given independent variable (and not from the dependent variable), then the remaining correlation of the given variable with the dependent is computed, giving the semipartial (part) correlation. Part correlation is almost always lower than the corresponding partial correlation. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Part and partial correlation is checked to get semi-partial correlations. Part correlation appears in the Coefficients table in SPSS output. Semi-partial correlation is discussed further in the section on partial correlation.
  - Partial correlation squared, by contrast, reflects the percent of unexplained variance in the dependent explained by adding the given variable. That is, it reflects the percent of unexplained variance uniquely attributable to the given independent variable. The linear effects of the other independents are removed from both the given independent variable and from the dependent variable, then the correlation of the remaining/adjusted given variable with the remaining/adjusted dependent variable is computed, yielding the partial correlation. You will get the same relative ranking of independents if you use semipartial r or partial r, but partial r will almost always be higher. Partial correlation appears in the Coefficients table in SPSS output.
- The intercept, variously expressed as e, c, or x-sub-0, is the estimated Y value when all the independents have a value of 0. Sometimes this has real meaning and sometimes it doesn' t — that is, sometimes the regression line cannot be extended beyond the range of observations, either back toward the Y axis or forward toward infinity. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Estimates is checked to get the intercept, labeled the "constant" (the default). SPSS allows the researcher to check a box to not have an intercept. This is equivalent to forcing the regression line to run through the origin. As a rule of thumb, never choose this option. In rare cases the researcher may know the relation is linear and that the dependent variable is zero when all the independents are zero, in which case the option may be selected.
- R², also called multiple correlation or the coefficient of multiple determination, is the percent of the variance in the dependent explained uniquely or jointly by the independents. R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents. That is, R² reflects the number of errors made when using the regression model to guess the value of the dependent, in ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases.
  Mathematically, R² = (1 - (SSE/SST)), where SSE = error sum of squares = SUM((Yi - EstYi)squared), where Yi is the actual value of Y for the ith case and EstYi is the regression prediction for the ith case; and where SST = total sum of squares = SUM((Yi - MeanY)squared). Sums of squares are shown in the ANOVA table in SPSS output, where the example computes R² = (1-(1140.1/1172.358)) = 0.028. The "residual sum of squares" in SPSS output is SSE and reflects regression error. Thus R-square is 1 minus regression error as a percent of total error and will be 0 when regression error is as large as it would be if you simply guessed the mean for all cases of Y. Put another way, the regression sum of squares/total sum of squares = R-square, where the regression sum of squares = total sum of squares - residual sum of squares. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Model fit is checked to get R².
  - SPSS Model Summary table
    
    The Model Summary table in SPSS output, shown above, gives R, R², adjusted R², the standard error of estimate (SEE), R² and F change and the corresponding significance level, and the Durbin-Watson statistic. In the example above, number of accidents is predicted from age and gender. This output shows age and gender together explain nly 2.4% of the variance in number of accidents for this sample. R² is close to adjusted R² because there are only two independent variables (adjusted R² is discussed below). R² change is the same as R² because the variables were entered at the same time (not stepwise or in blocks), so there is only one regression model to report, and R² change is change from the intercept-only model, which is also what R² is.. R² change is discussed below. Since there is only one model, "Sig F Change" is the overall significance of the model, which for one model is also the significance of adding the sex and age to the model in addition to the intercept. The Durbin-Watson statistic is a test to see if the assumption of independent observations is met, which is the same as testing to see if autocorrelation is present. As a rule of thumb, a Durbin-Watson statistic in the range of 1.5 to 2.5 means the researcher may reject the notion that data are autocorrelated (serially dependent) and instead may assume independence of observations, as is the case here. The Durbin-Watson test is discussed below.
  - Small samples. For small samples, the F test of the overall regression model may be non-significant even though t-tests for some b coefficients are significant. Dropping variables with non-significant b's may lead to a significant F in this situation.
    - Maximizing R² by adding variables is inappropriate unless variables are added to the equation for sound theoretical reason. At an extreme, when n-1 variables are added to a regression equation, R² will be 1, but this result is meaningless. Adjusted R² is used as a conservative reduction to R² to penalize for adding variables and is required when the number of independent variables is high relative to the number of cases or when comparing models with different numbers of independents.
    - R² differences between samples. R², like other forms of correlation, is sensitive to restricted variance. Achen (1982: 75) gives the example of a study of the influence of a measure of a newspaper's bias in stories,. Bias was used to predict votes for candidates in primary and general elections. The correlation for general elections was .84 and for primary elections was .64, tempting the wrong conclusion that newspaper bias was more influential in general elections. However, the variance of bias was much less in the primaries than in the general elections. The greater variance in general elections allowed more explanation of the variance in votes in general elections, especially since the general elections had less variance to explain. However, the primaries exhibited a higher b coefficient (that is, an additional biased story in the primaries had a greater impact on the percent of votes for the candidate). Achen thus warns that R² 's cannot be compared between samples due to differences in variances of the independent and dependent variables.
    - Adjusted R-Square is an adjustment for the fact that when one has a large number of independents, it is possible that R² will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent. At the extreme, when there are as many independents as cases in the sample, R² will always be 1.0. The adjustment to the formula lowers R² as p, the number of independents, increases. Put another way, adjusted R² is a downward adjustment of R² to adjust for one model having more degrees of freedom than another model. When used for the case of a few independents, R² and adjusted R² will be close. When there are many independents, adjusted R² may be noticeably lower (and in rare cases may even be negative). The greater the number of independents, the more the researcher is expected to report the adjusted coefficient. Always use adjusted R² when comparing models with different numbers of independents. However, Gujarati (2006: 229) also recommends, "Even when we are not comparing two regression models, it is a good practice to find the adjusted R² value because it explicitly takes into account the number of variables included in the model." Adjusted R² = 1 - ( (1-R²)(N-1 / N - k - 1) ), where n is sample size and k is the number of terms in the model not counting the constant (i.e., the number of independents). In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Estimates is checked to get adjusted R-Square. Adjusted R² appears in the Model Summary table in SPSS output.
  - R² change, also called R² increments, refers to the amount R² increases or decreases when a variable is added to or deleted from the equation as is done in stepwise regression or if the researcher enters independent variables in blocks. If the "Enter" method is used to enter all independents at once in a single model, R² change for that model will reflect change from the intercept-only model. The R² difference test refers to running regression for a full model and for the model minus one variable, then subtracting the R²'s and testing the significance of the difference. Since stepwise regression adds one variable at a time to the regression model, generating an R² value each time, subtracting each R² from the prior one also gives the R² increment. R² increments are tested by the F-test and are intrinsic to hierarchical regression, discussed below. In SPSS, Analyze, Regression, Linear; select the dependent and independents; click Next; add additional independents to be added to the Block 2 of 2 group; click Statistics; select Model Fit and R²-change; Continue; Ok. R,sup>2 change appears in the Model Summary table in SPSS output. If there is more than one model, output will appear in Model 1 and Model 2 rows, with an F test of the change increment. Note that the test for adding variables to a model is statistically equivalent to the test for dropping the same variables from the model.
  - Example. In the example below, number of children is predicted from year of highest education and number of grandparents born outside the country. Each independent is entered in a separate block and under the Statistics button, R² change is requested.
    
    Below, SPSS prints out R² change and its corresponding significance level. For Block/Model 1, R² change is .05 and as significant at better than the .001 level. This is the change from the intercept-only model to the model with years education as a predictor. Because there is only one independent (number of grandparents born abroad) added in Block 2, the significance of the R² change is the same as the significance of the b coefficient for that variable, namely a non-significant .397 in this example. While for this simple example use of the R² change method is not necessary, for more complex tests, such as testing sets of independents or testing dummy variables as discussed below, the R² change method is standard.
  - R² change and dummy variables. The incremental F test used with R² change must be used to assess the significance of a set of dummy variables. Do not use individual t-tests of b coefficients of the dummy variables.
    F-incremental = [(R²_with - R²_without)/m] / [(1 - R²)/df]
    where m = number of IVs in new block which is added; and df = N - k - 1 (where N is sample size; k is number of indpendent variables). F is read with m and df degrees of freedom to obtain a p (probability) value. Note the without model is nested within the with model. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure R squared change is checked to get "Sig F Change".
    - Relation of beta weights to R² and R² increments. Some authors state that the ratio of the squared beta weights indicates each independent variable's R-square increment. This is incorrect.
      The beta weights for the equation in the final step of stepwise regression do not partition R² into increments associated with each independent because beta weights are affected by which variables are in the equation. The beta weights estimate the relative predictive power of each independent, controlling for all other independent variables in the equation for a given model. The R² increments estimate the predictive power an independent brings to the analysis when it is added to the regression model, as compared to a model without that variable. Beta weights compare independents in one model, whereas R² increments compare independents in two or more models.
      This means that assessing a variable's importance using R² increments is very different from assessing its importance using beta weights. The magnitude of a variable's beta weight reflects its relative explanatory importance controlling for other independents in the equation. The magnitude of a variable's R² increment reflects its additional explanatory importance given that common variance it shares with other independents entered in earlier steps has been absorbed by these variables. For causal assessments, beta weights are better (though see the discussion of corresponding regressions for causal analysis). For purposes of sheer prediction, R² increments are better.
  - Squared semipartial (part) correlation: the proportion of total variance in a dependent variable explained uniquely by a given independent variable after other independnt variables in the model have been controlled. When the given independent variable is removed from the equation, R² will be reduced by this amount. Likewise, it may be interpreted as the amount R² will increase when that independent is added to the equation. R² minus the sum of all squared semi-partial correlations is the variance explained jointly by all the independents (the "shared variance"of the model). In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Part and partial correlation is checked to get semi-partial correlations.See part correlation.
  - Squared partial correlation: the proportion of variance explained uniquely by the given independent variable (after both the IV and the dependent have been adjusted to remove variance they share with other IVs) in the model. Thus the squared partial correlation coefficent is the percent of unexplained variance in the dependent which now can be accounted for when the given independent variable is added to the model. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Part and partial correlation is checked to get partial correlations.
Residual analysis
- Residuals are the difference between the observed values and those predicted by the regression equation. Residuals thus represent error, as in most statistical procedures. Residual analysis is used for three main purposes: (1) to spot heteroscedasticity (ex., increasing error as the observed Y value increases), (2) to spot outliers (influential cases), and (3) to identify other patterns of error (ex., error associated with certain ranges of X variables). There are five main types of residuals, supplemented by five measures of influence, discussed below.
  The removal of outliers from the data set under analysis can at times dramatically affect the performance of a regression model. Outliers should be removed if there is reason to believe that other variables not in the model explain why the outlier cases are unusual -- that is, outliers may well be cases which need a separate model. Alternatively, outliers may suggest that additional explanatory variables need to be brought into the model (that is, the model needs respecification). Another alternative is to use robust regression, whose algorithm gives less weight to outliers but does not discard them.
  1. Unstandardized residuals, referenced as RESID in SPSS, refer in a regression context to the linear difference between the location of an observation (point) and the regression line (or plane or surface) in multidimensional space.
  2. Standardized residuals, of course, are residuals after they have been constrained to a mean of zero and a standard deviation of 1. A rule of thumb is that outliers are points whose standardized residual is greater than 3.3 (corresponding to the .001 alpha level). SPSS will list "Std. Residual" if "casewise diagnostics" is requested under the Statistics button, as illustrated in the Casewise Diagnostics table below..
    
    In the figure above, for the example of predicting auto accidents from sex and age, the Casewise Diagnostics tables shows two outliers: cases 166 and 244.
  3. Deleted residuals, also called "jacknife residuals," compute the standard deviation omitting the given observation prior to standardizing or studentizing the residual. Deletion does not apply to unstandardized residuals, so "deleted residuals" actually are standardized deleted residuals. Analysis of outliers usually focuses on deleted residuals.
  4. Studentized residuals are constrained only to have a standard deviation of 1, but are not constrained to a mean of 0.
  5. Studentized deleted residuals are residuals which have been constrained to have a standard deviation of 1, after the standard deviation is calculated leaving the given case out. Studentized deleted residuals are often used to assess the influence of a case and identify outliers.
    There will be a t value for each studentized deleted residual, with df = n - k - 1, where k is the number of independent variables. When t exceeds the critical value for a given alpha level (ex., .05) then the case is considered an outlier. In a plot of deleted studentized residuals versus ordinary residuals, one may draw lines at plus and minus two standard units to highlight cases outside the range where 95% of the cases normally lie. Points substantially off the straight line are potential leverage problems.
  6. Outlier statistics. SPSS also supports five other measures of case influence: DfBeta. standardized DfBeta, DfFit, standardized DfFit, and the covariance ratio. These outlier statistics and distance measures (Mahalanobis, Cook's D, and leverage) are discussed below in the "Assumptions: No outliers" section.
- Obtaining residual and outlier statistics in SPSS. Click the Save button in SPSS to add and save these coefficients as new variables in your dataset. SPSS prints these statistics in the Residuals Statistics table, illustrated below for the example of predicting auto accidents from sex and age.
- Partial regression plots, also called partial regression leverage plots or added variable plots, are used to assess outliers and also to assess linearity. A partial regression plot is a scatterplot of the residuals of an independent variable on the x axis against the residuals of the dependent variable on the y axis, where the residuals are those obtained when the rest of the independent variables are used to predict the dependent and then the given independent variable separately. In the Chart Editor, the plots can be made to show cases by number or label instead of dots. The dots will approach being on a line if the given independent is linearly related to the dependent, controlling for other independents. Dots far from the line are outliers. The most influential outliers are far from the line in both x and y directions. In SPSS, select Analyze, Regression, Linear; click Plots; check Produce all partial plots.
  
  In the partial regression plot above, for the example of sex and age predicting car accidents, sex is being used first to predict accidents, then to predict age. Since sex does not predict age at all and predicts only a very small percentage of accidents, the pattern of residuals in the partial regression plot forms a random cloud. This case, where the dots do not form a line, does not indicate lack of linearity of age with accidents but rather correlation approaching zero.
- Multicollinearity is the intercorrelation of independent variables. R²'s near 1 violate the assumption of no perfect collinearity, while high R²'s increase the standard error of the beta coefficients and make assessment of the unique role of each independent difficult or impossible. While simple correlations tell something about multicollinearity (as in the Coefficients Correlation table of SPSS regression output), the preferred method of assessing multicollinearity is to regress each independent on all the other independent variables in the equation. Inspection of the correlation matrix reveals only bivariate multicollinearity, with the typical criterion being bivariate correlations > .90. Note that a corollary is that very high standard errors of b coefficients is an indicator of multicollinearity in the data. To assess multivariate multicollinearity, one uses tolerance or VIF, which build in the regressing of each independent on all the others. Even when multicollinearity is present, note that estimates of the importance of other variables in the equation (variables which are not collinear with others) are not affected. See further discussion in the section on testing assumptions. For strategies for dealing with multicollinearity. see below.
Multicollinearity
- Definition. Multicollinearity refers to excessive correlation of the predictor variables. When correlation is excessive (some use the rule of thumb of r > /90), standard errors of the b and beta coefficients become large, making it difficult or impossible to assess the relative importance of the predictor variables. Multicollinearity is less important where the research purpose is sheer prediction since the predicted values of the dependent remain stable, but multicollinearity is a severe problem when the research purpose includes causal modeling.
- Types of multicollinearity. The type of multicollinearity matters a great deal. Some types are necessary to the research purpose! See the discussion below in the assumptions section.
  - Tolerance is is 1 - R² for the regression of that independent variable on all the other independents, ignoring the dependent. There will be as many tolerance coefficients as there are independents. The higher the intercorrelation of the independents, the more the tolerance will approach zero. As a rule of thumb, if tolerance is less than .20, a problem with multicollinearity is indicated. In SPSS, select Analyze, Regression, Linear; click Statistics; check Collinearity diagnostics to get tolerance.
    When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable.The more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient.
  - Variance-inflation factor, VIF VIF is the variance inflation factor, which is simply the reciprocal of tolerance. Therefore, when VIF is high there is high multicollinearity and instability of the b and beta coefficients. VIF and tolerance are found in the SPSS output section on collinearity statistics. The table below shows the inflationary impact on the standard error of the regression coefficient (b) of the jth independent variable for various levels of multiple correlation (R_j), tolerance, and VIF (adapted from Fox, 1991: 12). In SPSS, select Analyze, Regression, Linear; click Statistics; check Collinearity diagnostics to get VIF. Tolerance and VIF appears in the Coefficients table in SPSS output. Note that in the "Impact on SE" column, 1.0 corresponds to no impact, 2.0 to doubling the standard error, etc.:
    
    R_j Tolerance VIF Impact on SE_b
    
    0 1 1 1.0
    
    .4 .84 1.19 1.09
    
    .6 .64 1.56 1.25
    
    .75 .44 2.25 1.5
    
    .8 .36 2.78 1.67
    
    .87 .25 4.0 2.0
    
    .9 .19 5.26 2.29
    
    Standard error is doubled when VIF is 4.0 and tolerance is .25, corresponding to R_j = .87. Therefore VIF >= 4 is an arbitrary but common cut-off criterion for deciding when a given independent variable displays "too much" multicollinearity: values above 4 suggest a multicollinearity problem. Some researchers use the more lenient cutoff of 5.0 or even 10.0 to signal when multicollinearity is a problem. The researcher may wish to drop the variable with the highest VIF if multicollinearity is indicated and theory warrants.
  - The collinearity diagnostics table in SPSS is an alternative method of assessing if there is too much multicollinearity in the model. To simplify, crossproducts of the independent variables are factored. There will be as many factors as independents, plus one (for the constant). High eigenvalues indicate dimensions (factors) which account for a lot of the variance in the crossproduct matrix. Eigenvalues close to 0 indicate dimensions which explain little variance. Multiple eigenvalues close to 0 indicate an ill-conditioned crossproduct matrix, meaning there may be a problem with multicollinearity and the condition indices should be examined as described below.
    - Condition indices and variance proportions. Condition indices are used to flag excessive collinearity in the data. A condition index over 30 suggests serious collinearity problems and an index over 15 indicates possible collinearity problems. If a factor (component) has a high condition index, one looks in the variance proportions column. Criteria for "sizable proportion" vary among researchers but the most common criterion is if two or more variables have a variance proportion of .50 or higher on a factor with a high condition index. If this is the case, these variables have high linear dependence and multicollinearity is a problem, with the effect that small data changes or arithmetic errors may translate into very large changes or errors in the regression analysis. Note that it is possible for the rule of thumb for condition indices (no index over 30) to indicate multicollinearity, even when the rules of thumb for tolerance > .20 or VIF < 4 suggest no multicollinearity. Computationally, a "singular value" is the square root of an eigenvalue, and "condition indices" are the ratio of the largest singular values to each other singular value. In SPSS, select Analyze, Regression, Linear; click Statistics; check Collinearity diagnostics to get condition indices. For further discussion, see Belsley, Kuh and Welsch (1980). Condition indices and variance proportions appear in the Collinearity Diagnostics table in SPSS output.
      
      The figure above is output for the example of predicting accidents from gender and age. This further confirms that this example has no collinearity problem since no condition index approaches 30, making it unnecessary to examine variance proportions.
Stepwise multiple regression, also called statistical regression, is a way of computing OLS regression in stages. In stage one, the independent best correlated with the dependent is included in the equation. In the second stage, the remaining independent with the highest partial correlation with the dependent, controlling for the first independent, is entered. This process is repeated, at each stage partialling for previously-entered independents, until the addition of a remaining independent does not increase R-squared by a significant amount (or until all variables are entered, of course). Alternatively, the process can work backward, starting with all variables and eliminating independents one at a time until the elimination of one makes a significant difference in R-squared. In SPSS, select Analyze, Regression, Linear; set the Method: box to Stepwise.
- Stepwise regression and theory. Stepwise regression is used in the exploratory phase of research or for purposes of pure prediction,, not theory testing. In the theory testing stage the researcher should base selection of the variables and their order on theory, not on a computer algorithm. Menard (1995: 54) writes, "there appears to be general agreement that the use of computer-controlled stepwise procedures to select variables is inappropriate for theory testing because it capitalizes on random variations in the data and produces results that tend to be idosyncratic and difficult to replicate in any sample other than the sample in which they were originally obtained." Likewise, the nominal .05 significance level used at each step in stepwise regression is subject to inflation, such that the real significance level by the last step may be much worse, even below .50, dramatically increasing the chances of Type I errors. See Draper, N.R., Guttman, I. & Lapczak, L. (1979). For this reason, Fox (1991: 18) strongly recommends any stepwise model be subjected to cross-validation.
- Other problems of stepwise regression. By brute force fitting of regression models to the current data, stepwise methods can overfit the data, making generalization across data sets unreliable. A corollary is that stepwise methods can yield R² estimates which are substantially too high, significance tests which are too lenient (allow Type 1 error), and confidence intervals that are too narrow. Also, stepwise methods are even more affected by multicollinearity than regular methods. The more variables, particularly when multicollinearity is present, the more difficulty the stepwise regression algorithm will have in arriving at the correct global solution.
- Dummy variables in stepwise regression. Note that if one is using sets of dummy variables, the stepwise procedure is performed by specifying blocks of variables to add. However, there is no automatic way to add/remove blocks of dummy variables; rather SPSS will treat each dummy as if it were an ordinary variable. That is, if using dummy variables one must run a series of manually-created equations which add/remove sets of dummy variables as a block.
Hierarchical multiple regression (not to be confused with hierarchical linear models) is similar to stepwise regression, but the researcher, not the computer, determines the order of entry of the variables. F-tests are used to compute the significance of each added variable (or set of variables) to the explanation reflected in R-square. This hierarchical procedure is an alternative to comparing betas for purposes of assessing the importance of the independents. In more complex forms of hierarchical regression, the model may involve a series of intermediate variables which are dependents with respect to some other independents, but are themselves independents with respect to the ultimate dependent. Hierarchical multiple regression may then involve a series of regressions for each intermediate as well as for the ultimate dependent.
Panel data regression
- The Hausman test is a criterion used for choosing whether to emply the usual fixed effects model or the random effects model. A finding of significance on the Hausman test indicates the researcher should use the fixed effects model; nonsignificance indicates it is permissible to use the more efficient random effects model.
- Software packages. Panel data regression is supported explicitly by Stata, though it is possible to create panel data models in SAS and SPSS.

R_j	Tolerance	VIF	Impact on SE_b
0	1	1	1.0
.4	.84	1.19	1.09
.6	.64	1.56	1.25
.75	.44	2.25	1.5
.8	.36	2.78	1.67
.87	.25	4.0	2.0
.9	.19	5.26	2.29

Assumptions

Proper specification of the model: If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. If causally irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to the irrelevant variables. The more the correlation of the irrelevant variable(s) with other independents, the greater the standard errors of the regression coefficients for these independents. Omission and irrelevancy can both affect substantially the size of the b and beta coefficients. This is one reason why it is better to use regression to compare the relative fit of two models rather than to seek to establish the validity of a single model.
Note: Adding variables to the model will always improve R² at least a little for the current data, but it risks misspecification and does not necessarily improve R² for other datasets examined later on. That is, it can overfit the regression model to noise in the current dataset and actually reduce the reliability of the model.
Sometimes specification is phrased as the assumption that "independent variables are measured without error." Error attributable to omitting causally important variables means that, to the extent that these unmeasured variables are correlated with the measured variables which are in the model, the b coefficients will be off. If the correlation is positive, then b coefficients will be too high; if negative, too low. That is, when a causally important variable is added to the model, the b coefficients will all change, assuming that variable is correlated with existing measured variables in the model (usually the case).
- Spuriousness. The specification problem in regression is analogous to the problem of spuriousness in correlation, where a given bivariate correlation may be inflated because one has not yet introduced control variables into the model by way of partial correlation. For instance, regressing height on hair length will generate a significant b coefficient, but only when gender is left out of the model specification (women are shorter and tend to have longer hair).
- Suppression. Note that when the omitted variable has a suppressing effect, coefficients in the model may underestimate rather than overestimate the effect of those variables on the dependent. Suppression occurs when the omitted variable has a positive causal influence on the included independent and a negative influence on the included dependent (or vice versa), thereby masking the impact the independent would have on the dependent if the third variable did not exist.
Appropriate modeling of control variables. Beta weights as effect size measures in regression are semi-partial coefficients computed on the assumption that each predictor variable is a control variable for each other predictor variable. The ramifications of this are discussed more extensively with an example in the section on partial correlation.
Population error is assumed to be uncorrelated with each of the independents. This is the "assumption of mean independence": that the mean error is independent of the X (independent) variables. This is a critical regression assumption which, when violated, may lead to serious misinterpretation of output. Unfortunately, it is an assumption which cannot be tested statistically.
The population error term, which is the difference between the actual values of the dependent and those estimated by the population regression equation, should be uncorrelated with each of the independent variables. Since the population regression line is not known for sample data, the assumption must be assessed by theory. One common type of correlated error occurs due to selection bias with regard to membership in the independent variable "group" (representing membership in a treatment vs. a comparison group): measured factors such as gender, race, education, etc., may cause differential selection into the two groups and also can be correlated with the dependent variable.
In practical terms, the researcher must be confident that (1) the variables not included in the equation are indeed not causes of Y and are not correlated with the variables which are included; and (2) that the dependent is not also a cause of one or more of the independents. Either circumstance would violate the assumption of uncorrelated error. When there is correlated error, conventional computation of standard deviations, t-tests, and significance are biased and yield invalid results. Coefficients may be over- or under-estimated.
Note that residual error -- the difference between observed values and those estimated by the sample regression equation -- will always be uncorrelated with the independents and therefore the lack of correlation of the residuals with the independents is not a valid test of this assumption.
- Non-recursivity. The dependent cannot also be a cause of one or more of the independents. This is also called the assumption of non-simultaneity or absence of joint dependence. Violation of this assumption causes regression estimates to be biased and means significance tests will be unreliable.
- Two-Stage Least Squares (2SLS), discussed separately, is designed to extend the regression model to situations where non-recursivity is introduced because the researcher must assume the correlations of some error terms are not 0. It can be used, for instance, to test for selection bias. Click here for the separate discussion.
No overfitting. Overfitting occurs when there are too many predictors in relation to sample size. The researcher adds variables to the equation while hoping that adding each significantly increases R-squared. However, there is a temptation to add too many variables just to increase R-squared by trivial amounts. Such overfitting trains the model to fit noise in the data rather than true underlying relationships. Overfitting may also occur for non-trivial increases in R-squared if sample size is small. Subsequent application of the model to other data may well see substantial drops in R-squared.
- Cross-validation is a strategy to avoid overfitting. Under cross-validation, a sample (typically 60% to 80%) is taken for purposes of training the model, then the hold-out sample (the other 20% to 40%) is used to test the stability of R-squared. This may be done iteratively for each alternative model until stable results are achieved.
Absence of perfect multicollinearity. When there is perfect multicollinearity, there is no unique regression solution. Perfect multicollinearity occurs if independents are linear functions of each other (ex., age and year of birth), when the researcher creates dummy variables for all values of a categorical variable rather than leaving one out, and when there are fewer observations than variables.
Absence of high partial multicollinearity. When there is high but imperfect multicollinearity, a solution is still possible but as the independents increase in correlation with each other, the standard errors of the regression coefficients will become inflated. High multicollinearity does not bias the estimates of the coefficients, only their reliability. This means that it becomes difficult to assess the relative importance of the independent variables using beta weights. It also means that a small number of discordant cases potentially can affect results strongly. The importance of this assumption depends on the type of multicollinearity. In the discussion below, the term "independents" refers to variables on the right-hand side of the regression equation other than control variables.
- Multicollinearity among the independents: This type of multicollinearity is the main research concern as it inflates standard errors and makes assessment of the relative importance of the independents unreliable. However, if sheer prediction is the research purpose (as opposed to causal analysis), it may be noted that high multicollinearity of the independents does not affect the efficiency of the regression estimates.
- Multicollinearity of independent construct components: When two or more independents are components of a scale, index, or other construct, high intercorrelation among them is intentional and desirable. Ordinarily this is not considered "multicollinearity" at all, but collinearity diagnostics may report it as such. Usually the researcher will combine such sets of variables into scales or indices prior to running regression, but at times the researcher may prefer to enter them individually and interpret them as a block.
- Multicollinearity of crossproduct independents: Likewise, crossproduct interaction terms may be highly correlated with their component terms. This is intentional and usually is not considered "multicollinearity" but collinearity diagnostics may report it as such.
- Multicollinearity of power term independents . Power terms may be correlated with first-order terms. The researcher should center such variables to eliminate multicollinearity associated with the first-order variable's mean. This is particularly necessary when the mean is large.
- Multicollinearity among the controls: High multicollinearity among the control variables will not affect research outcomes provided the researcher is not concerned with assessing the relative importance of one control variable compared to another.
- Multicollinearity of controls with independents: This is not necessarily a research problem but rather may mean that the control variables will have a strong effect on the independents, showing the independents are less important than their uncontrolled relation with the dependent would suggest.
  That is, the problem with multicollinearity is that it increases the standard errors of the b coefficients and corresponding beta weights. Beta weights are used to assess the importance of the predictors. If the beta weights have larger standard errors, it is harder to distinguish among them. This poses the greatest problem when referring to multicollinearity among the independents, and most multicollinearity warnings in textbooks are referring to this type of multicollinearity. If there are two sets of variables on the predictor side of the equation, (1) independents of interest and (2) control variables, and if there is no multicollinearity within set (1) but there is multicollinearity between the two sets, one can still use beta weights to distinguish within set (1) but the control variables may reduce the effect sizes of the set (1) variables (except in the case of suppression, when the effect size may actually increase).
- Multicollinearity of independents or controls with the dependent variable: This is not a research problem unless the high correlation indicates definitional overlap, such that the apparent effect is actually a tautological artifact. Otherwise high correlation of the independents with the dependent simply means the model explains a great deal of the variance in the dependent variable. High correlation of the controls will be associated with strong control effects on the independents. This type of high correlation ordinarily is not considered "multicollinearity" at all.
Linearity. Regression analysis is a linear procedure. To the extent nonlinear relationships are present, conventional regression analysis will underestimate the relationship. That is, R-square will underestimate the variance explained overall and the betas will underestimate the importance of the variables involved in the non-linear relationship. Substantial violation of linearity thus means regression results may be more or less unusable. Minor departures from linearity will not substantially affect the interpretation of regression output. Checking that the linearity assumption is met is an essential research task when use of regression models is contemplated. The Curve Estimation procedure in SPSS is one way to check linearity of the relationships in one's data.
In regression, as a rule of thumb, nonlinearity is generally not a problem when the standard deviation of the dependent is more than the standard deviation of the residuals. Linearity is further discussed in the section on data assumptions. Note also that regression smoothing techniques and nonparametric regression exist to fit smoothed curves in a nonlinear manner.
- Nonlinear transformations. When nonlinearity is present, it may be possible to remedy the situation through use of exponential or interactive terms. Nonlinear transformation of selected variables may be a pre-processing step, but beware that this runs the danger of overfitting the model to what are, in fact, chance variations in the data. Power and other transform terms should be added only if there is a theoretical reason to do so. Adding such terms runs the risk of introducing multicollinearity in the model. A guard against this is to use centering when introducing power terms (subtract the mean from each score). Correlation and unstandardized b coefficients will not change as the result of centering.
- Partial regression plots are often used to assess nonlinearity. These are simply plots of each independent on the x axis against the dependent on the y axis. Curvature in the pattern of points in a partial regression plot shows if there is a nonlinear relationship between the dependent and any one of the independents taken individually. Note, however, that whereas partial regression plots are preferred for illuminating cases with high leverage, partial residual plots (below) are preferred for illuminating nonlinearities.
- Partial residual plots, also called component-plus-residual plots, are a preferred method of assessing nonlinearity. A partial residual for a given independent is the residual plus the product of the b coefficient times the value of a case on that independent. That is, partial residuals add the linear component of an independent back into the residual (hence component-plus-residual plots). The partial residual plot shows a given independent on the y axis and the corresponding partial residual on the x axis. There is one partial residual plot per independent variable. Partial residual plots have the advantage over partial regression plots in that the y axis incorporates a b coefficient, which in turn reflects both the independent variable and control effects on that variable by other independents. The slope of the partial residuals will be the same as for the regression, but a lowess smoothing line may be drawn to highlight curvature of the data. The curvature of partial regression plots can illustrate both monotone (falling or rising) and nonmonotone (falling and rising) nonlinearities.
- Simple residual plots also show nonlinearity but do not distinguish monotone from nonmonotone nonlinearity. These are usually plots of standardized residuals against standardized estimates of Y, the dependent variable. In SPSS this is ZRESID vs. ZPRED. Some authors prefer to plot studentized residuals (saved as sre_1) on the Y axis by the unstandardized predicted values (such as pre_1) on the X axis.The plot should show a random pattern, with no nonlinearity or heteroscedasticity. In jargon, this will show the error vector is orthogonal to the estimate vector. Non-linearity is, of course, shown when points form a curve. Non-normality is shown when points are not equally above and below the Y axis 0 line. Non-homoscedasticity is shown when points form a funnel or other shape showing variance differs as one moves along theY axis.
Continuous data are required (interval or ratio), though it is common to use ordinal data. Dummy variables form a special case and are allowed in OLS regression as independents. Dichotomies may be used as independents but not as the dependent variable. Normally dichotomies are coded 0 or 1, with 1 indicating the presence of the effect and 0 being absence of the effect. Use of a dichotomous dependent in OLS regression violates the assumptions of normality and homoscedasticity as a normal distribution is impossible with only two values. Also, when the values can only be 0 or 1, residuals will be low for the portions of the regression line near Y=0 and Y=1, but high in the middle -- hence the error term will violate the assumption of homoscedasticity (equal variances) when a dichotomy is used as a dependent. Even with large samples, standard errors and significance tests will be in error because of lack of homoscedasticity.
Unbounded data are an assumption. That is, the regression line produced by OLS can be extrapolated in both directions but is meaningful only within the upper and lower natural bounds of the dependent.
Data are not censored, sample selected, or truncated. There are as many observations of the independents as for the dependents. Collapsing an interval variable into fewer categories leads to attenuation and will reduce R². Click here for further discussion of this assumption.
The same underlying distribution is assumed for all variables. To the extent that an independent variable has a different underlying distribution compared to the dependent (bimodal vs. normal, for instance), then a unit increase in the independent will have nonlinear impacts on the dependent. Even when independent/dependent data pairs are ordered perfectly, unit increases in the independent cannot be associated with fixed linear changes in the dependent. For instance, perfect ordering of a bimodal independent with a normal dependent will generate an s-shaped scatterplot not amenable to a linear solution. Linear regression will underestimate the correlation of the independent and dependent when they come from different underlying distributions.
- Multivariate normality. Most significance statistics build on the normal distribution, so it is usual for the common underlying distribution to be required to be normally distributed. Strictly speaking, the dependent variable should be normally distributed for each combination of values of the independents. This is "multivariate normality." Most researchers content thenselves with establishing univariate normality of each of the independents individually, however.
  - Transforms. Transforms are sometimes used to force all variables to a normal distribution. For instance, square root, logarithmic, and inverse (x = 1/x) transforms correct a positively skewed distribution, while powers correct a negatively skewed distribution. Such transforms lose information about the original scale of the variable, however.
Normally distributed residual error: Error, represented by the residuals, should be normally distributed for each set of values of the independents. A histogram of standardized residuals should show a roughly normal curve. In the example below, residuals are right-skewed, based on the example of predicting accidents from sex and age. Recall that the Shapiro=Wilk's test, discussed in the section on testing assumptions, is the standard test for normal distribution and returns a finding of non-significance if residual error, in this case, is normally distributed.

An alternative for the same purpose is the normal probability plot, with the observed cumulative probabilities of occurrence of the standardized residuals on the Y axis and of expected normal probabilities of occurrence on the X axis, such that a 45-degree line will appear when observed conforms to normally expected. For the same example, the P-P plot below shows the same moderate departure from normality.

The F test is relatively robust in the face of small to medium violations of the normality assumption. The central limit theorem assumes that even when error is not normally distributed, when sample size is large, the sampling distribution of the b coefficient will still be normal. Therefore violations of this assumption usually have little or no impact on substantive conclusions for large samples, but when sample size is small, tests of normality are important. That is, for small samples, where error is not normally distributed, t-tests of regression coefficients are not accurate.
Histograms and P-P plots may be selected under the Plots button in the SPSS regression dialog. Alternatively, in SPSS, select Graphs, Histogram; specify sre_1 as the variable (this is the studentized residual, previously saved with the Save button in the regression dialog). One can also test the residuals for normality using a Q-Q plot: in SPSS, select Graphs, Q-Q; specify the studentized residual (sre_1) in the Variables list; click OK. Dots should approximate a 45 degree line when residuals are normally distributed.
Homoscedasticity (also spelled homoskedasticity): The researcher should test to assure that the residuals are dispersed randomly throughout the range of the estimated dependent. Put another way, the variance of residual error should be constant for all values of the independent(s). Lack of homoscedasticity may mean (1) there is an interaction effect between a measured independent variable and an unmeasured independent variable not in the model; or (2) that some independent variables are skewed while others are not.
Note that regression is relatively robust regarding homoscedasticity and small to moderate violations of homoscedasticity have only minor impact on regression estimates (Fox, 2005: 516). When homoscedasticity is a problem weighted least squares regression and quantile regression are commonly recommended. WLS regression causes cases with smaller residuals to be weighted more in calculating the b coefficients, diminishing the impact of heteroscedasticity. Square root, log, and reciproval transformations of the dependent may also reduce or eliminate lack of homoscedasticity but should only be undertaken if such transforms are sensibly interpretable and do not upset the relationship of the dependent to the independents. Finally, robust estimation of standard errors is an increasingly popular option for dealing with heteroscedasticity as such estimators are less sensitive to error variance misspecification. Robust estimators are available in SPSS when regression (or other models) is accomplished through the generalized linear models module, discussed in a separate section of Statnotes.
When there is lack of homoscedasticity, the regression model will be working better from some ranges of the dependent variable than for others (ex., better for high values than for low values). If homoscedasticity is violated, separate models may be required for the different ranges. Also, when the homoscedasticity assumption is violated "conventionally computed confidence intervals and conventional t-tests for OLS estimators can no longer be justified" (Berry, 1993: 81). While lack of homoscedasticity does not bias the b coefficient estimates, comnputed standard error will be too high if error variance is positively correlated with the independents and too low if negatively correlated.
Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables. Nonconstant error variance can be observed by requesting simple residual plots, as in the illustrations below, where "Training" as independent is used to predict "Score" as dependent:
1. Plot of the dependent on the X-axis against standardized predicted values on the Y axis. For the homoscedasticity assumption to be met, observations should be spread about the regression line similarly for the entire X axis. In the illustration below, which is heteroscedastic, the spread is much narrower for low values than for high values of the X variable, Score.
2. Plot of the dependent on the X-axis against standardized residuals on the Y axis. As illustrated below, this yields a similar plot with a similar interpretation. Under homoscedasticity, the trend line would be horizontal at the Y 0 point.
3. Plot of standardized estimates of the dependent on the X-axis against standardized residuals on the Y axis. This is what the term "residual plot" normally refers to, as discussed more fully above. A homoscedastic model will display a cloud of dots. In the illustration below, in which "Training" predicts "Score,", the regression model is overstimating low and high values of Score and underestimating values in the middle - one of several possible heteroscedastic patterns.
- Statistical tests of homoscedasticity. Available in SAS and Stata but not SPSS. See discussions by SAS. In Stata, type help hettest.
  - Breusch-Pagan test. (a.k.a., Cook-Weisberg test). A finding of significance means the assumption of homoscedasticity is violated. See Breusch and Pagan (1979); Cook & Weisberg (1983); Weisberg (1985: 135–140). The Stata syntax is cwhetero varlist.
  - White's general test. A finding of significance means the assumption of homoscedasticity is violated. This test is less sensitive to outliers compared to the Cook-Weisberg test. If homoscedasticity is violated, White's corrected standard error ("robust standard error") is available to be substituted for ordinary standard error, or WLS regression may be used. See White (1980).
- No outliers. Outliers are a form of violation of homoscedasticity. Detected in the analysis of residuals and leverage statistics, these are cases representing high residuals (errors) which are clear exceptions to the regression explanation. Outliers can affect regression coefficients substantially. The set of outliers may suggest/require a separate explanation. Some computer programs allow an option of listing outliers directly, or there may be a "casewise plot" option which shows cases more than 2 s.d. from the estimate. To deal with outliers, the researcher may remove them from analysis and seek to explain them on a separate basis, or transforms may be used which tend to "pull in" outliers. These include the square root, logarithmic, and inverse (x = 1/x) transforms.
  As illustrated in the admittedly extreme case comparison below, in which Training causes Score, altering two independent variable values to be outliers can change a strong R-square to a nul one. The influence of these two outliers, cases 2 and 3 in the illustration below, is measured by dfFit and dfBeta, discussed below.
  - Residuals and their use in spotting outliers are discussed above in the section on residual analysis.
  - Influence statistics in SPSS are selected under the Save button dialog.
    - DfBeta, called standardized DfBeta in SPSS, measures the change in b coefficients (measured in standard errors) due to excluding a case from the dataset. A DfBeta coefficient is computed for every observations. If DfBeta > 0, the case increases the slope; if < 0, the case decreases the slope. The case may be considered an influential outlier if |DfBeta| > 2. In an alternative rule of thumb, a case may be an outlier if |DfBeta|> 2/SQRT(n). In SPSS, select Analyze, Regression, Linear; click Save; check dfbeta(s) or Standardized DfBeta(s) to add these values to your data set as an additional column. Also in SPSS, select Analyze, Descriptive Statistics, Explore; select dfb1_1 as the dependent (previously saved with the Save button in the regression dialog); click Statistics, Select Outliers; Continue; OK.
    - Standardized DfBeta. Once DfBeta is standardized, it is easier to interpret. A rule of thumb flags as outliers those observations whose standardized DfBeta value is > 2 divided by the square root of N (sample size).
    - DfFit. DfFit measues how much the estimate (predicted value) changes as a result of a particular observation being dropped from analysis. The dfFit measure is quite similar to Cook's D, discussed below, though scaled differently. In SPSS, select Analyze, Regression, Linear; click Save; check dfFit to add these values to your data set as an additional column.
    - Standardized DfFit. Once DfFit is standardized, it is easier to interpret. A rule of thumb flags as outliers those observations whose standardized DfBeta value is > twice the square root of p/N, where p is the number of parameters in the model and N is sample size.
    - Covariance ratio. This ratio compares the determinant of the covariance matrix with and without inclusion of a given case. The closer the covariance ratio approaches 1.0, the less influential the observation.
  - Distance measures in SPSS are also selected under the Save button dialog.
    - Centered leverage statistic, h, also called the hat-value, is available to identify cases which influence regression coefficients more than others. The leverage statistic varies from 0 (no influence on the model) to almost 1 (completely determines the model). The maximum value is (N-1)/N, where N is sample size.
      A rule of thumb is that cases with leverage under .2 are not a problem, but if a case has leverage over .5, the case has undue leverage and should be examined for the possibility of measurement error or the need to model such cases separately. In SPSS, the minimum, maximum, and mean leverage is displayed by SPSS in the "Residuals Statistics" table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog. Also, select Analyze, Regression, Linear; click Save; check Leverage to add these values to your data set as an additional column.
      Graphical method with data labels: Influential cases with high leverage can be spotted graphically. Save lev_1 in the SPSS Save procedure above, then select Graphs, Scatter/Dot; select Simple Scatter; click Define; make lev_1 the Y axis and caseid the X axis; be sure to make an appropriate variable (like Name) the "Label cases by" variable; OK. Then double-click on the plot to bring up the Chart Editor; select Elements, Data Label Mode; click on cases high on the Y axis.
    - Mahalabobis distance. Mahalanobis distance is leverage times (N - 1), where N is sample size. The higher the Mahalanobis distance for a case, the more that case's values on independent variables diverge from average values.
      As a rule of thumb, the maximum Mahalanobis distance should not exceed the critical chi-squared value with degrees of freedom equal to number of predictors and alpha =.001, or else outliers may be a problem in the data. The minimum, maximum, and mean Mahalanobis distances are displayed by SPSS in the "Residuals Statistics" table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog.
    - Cook's distance, D, is another measure of the influence of a case (see the output example). Cook's distance measures the effect on the residuals for all other observations of deleting a given observation. Observations with larger D values than the rest of the data are those which have unusual influence or leverage.
      Fox (1991: 34) suggests as a cut-off for detecting influential cases, values of D greater than 4/(N - k - 1), where N is sample size and k is the number of independents. Others suggest D > 1 as the criterion to constitute a strong indication of an outlier problem, with D > 4/n the criterion to indicate a possible problem. In SPSS, the minimum, maximum, and mean Cook's D is displayed by SPSS in the "Residuals Statistics" table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog. Also, select Analyze, Regression, Linear; click Save; check Cook's to add these values to your data set as an additional column.
      One can also spot outliers graphically using Cook's distance, which highlights very (unduly) influential cases. In SPSS, save Cook;s distance (coo_1) using the Save button in the Regression dialog. Then elect Graphs, Scatter/Plot, Simple Scatter; click Define; let coo_1 be the Y axis and case number be the X axis; click OK. If the graph shows any points far off the line, you can lanel them by case number. Double-click in the chart to bring up the Chart Editor, then select Elements, Data Label Mode, then click on the outlying dot(s) to make the label(s) appear.
- Reliability: Reliability is reduced by measurement error and, since all variables have some measurement error, by having a large number of independent variables. To the extent there is random error in measurement of the variables, the regression coefficients will be attenuated. To the extent there is systematic error in the measurement of the variables, the regression coefficients will be simply wrong. (In contrast to OLS regression, structural equation modeling involves explicit modeling of measurement error, resulting in coefficients which, unlike regression coefficients, are unbiased by measurement error.) Note measurement error terms are not to be confused with residual error of estimate, discussed below.
- Additivity. Regression does not account for interaction effects, although interaction terms (usually products of standardized independents) may be created as additional variables in the analysis. As in the case of adding nonlinear transforms, adding interaction terms runs the danger of overfitting the model to what are, in fact, chance variations in the data. Such terms should be added only when there are theoretical reasons for doing so. That is, significant but small interaction effects from interaction terms not added on a theoretical basis may be artifacts of overfitting. Such artifacts are unlikely to be replicable on other datasets.
- Independent observations (absence of autocorrelation) leading to uncorrelated error terms. Current values should not be correlated with previous values in a data series. This is often a problem with time series data, where many variables tend to increment over time such that knowing the value of the current observation helps one estimate the value of the previous observation. Spatial autocorrelation can also be a problem when units of analysis are geographic units and knowing the value for a given area helps one estimate the value of the adjacent area. That is, each observation should be independent of each other observation if the error terms are not to be correlated, which would in turn lead to biased estimates of standard deviations and significance, though the estimates of the b coefficients remain unbiased. Specifically, standard error is underestimated for positive autocorrelation (the usual situation) or overestimated for negative autocorrelation.
  - The Durbin-Watson coefficient, d, tests for autocorrelation. The value of d ranges from 0 to 4. Values close to 0 indicate extreme positive autocorrelation; close to 4 indicates extreme negative autocorrelation; and close to 2 indicates no serial autocorrelation. As a rule of thumb, d should be between 1.5 and 2.5 to indicate independence of observations. Positive autocorrelation means standard errors of the b coefficients are too small. Negative autocorrelation means standard errors are too large. The Durbin-Watson statistic appears in the Model Summary table in SPSS output.
    Alternatively, the d value has an association p probability value for various significance cutoffs (ex., .05). For a given level of significance such as .05, there is an upper and a lower d value limit. If the computed Durbin-Watson d value for a given series is more than the upper limit, the null hypothesis of no autocorrelation is not rejected and it is assumed that errors are serially uncorrelated. If the computed d value is less than the lower limit, the null hypothesis is rejected and it is assumed that errors are serially correlated. If the computed value is in-between the two limits, the result is inconclusive. In SPSS, one can obtain the Durbin-Watson coefficient for a set of residuals by opening the syntax window and running the command, FIT RES_1, assuming the residual variable is named RES_1.
    For a graphical test of serial independence, a plot of studentized residuals on the Y axis against the sequence of cases (the caseid variable) on the X axis should show no pattern, indicating independence of errors. In SPSS, select Graphs, Scatter/Dot, Simple Scatter; specify sre_1 (the studentized residual, previously saved with the Save button in the regression dialog) as the Y axis and caseid as the X Axis; OK; double-click the graph to bring up the Chart Editor; select Options, Y Axis Reference Line; click Properties, specify 0 for the Y axis position; click Apply; Close.
    In the illustration below, bank data are used to see if credit card debt predicts age. The plot of the ID variable on the X axis against studentized residuals on the Y axis whows no pattern, indicating data are not autocorrelated.
    
    When autocorrelation is present, one may choose to use generalized least-squares (GLS) estimation rather than the usual ordinary least-squares (OLS). In iteration 0 of GLS, the estimated OLS residuals are used to estimate the error covariance matrix. Then in iteration 1, GLS estimation minimizes the sum of squares of the residuals weighted by the inverse of the sample covariance matrix.
- Mean population error is zero: The mean of the (population) error term (see above) should be zero. Since the population regression line is not known for sample data, the assumption must be assessed by analysis of nonresponse (see the section on survey research). Specifically, one must be confident that there is no selection bias causing certain subpopulations to be over- or under-represented. Note that mean residual error is always zero and thus is not a valid test of this assumption.
- Random sampling is required for significance testing in regression when data are a random sample. If used with enumeration data for an entire population, then significance tests are not relevant. When used with non-random sample data, significance tests would be relevant but unfortunately cannot be reliable and thus are not appropriate. Nonetheless, social scientists commonly use significance tests with non-random data due to their utility as an arbitrary decision criterion.
- Validity. As with all procedures, regression assumes measures are valid.
Example of SPSS Regression Output
- SPSS Regression Output
Frequently Asked Questions
- How do I report regression results?
- What is the logic behind the calculation of regression coefficients in multiple regression?
- All I want is a simple scatterplot with a regression line. Why won't SPSS give it to me?
- How big a sample size do I need to do multiple regression?
- Can R-squared be interpreted as the percent of the cases explained?
- When may ordinal data be used in regression?
- How do I code dummy variables in regression?
- What is categorical regression, using nominal-level independent variables?
- What is "attenuation" in the context of regression?
- Is multicollinearity only relevant if there are significant findings?
- What can be done to handle multicollinearity?
  1. Increasing the sample size is a common first step since when sample size is increased, standard error decreases (all other things equal). This partially offsets the problem that high multicollinearity leads to high standard errors of the b and beta coefficients.
  2. Use centering: transform the offending independents by subtracting the mean from each case. The resulting centered data may well display considerably lower multicollinearity. You should have a theoretical justification for this consistent with the fact that a zero b coefficient will now correspond to the independent being at its mean, not at zero, and interpretations of b and beta must be changed accordingly. Centering is particularly important when using quadratic (power) terms in the model.
  3. Combine variables into a composite variable. This requires there be some theory which justifies this conceptually. This method also assumes that the beta weights of the variables being combined are approximately equal. In practice, few researchers test for this.
  4. Remove the most intercorrelated variable(s) from analysis. This method is misguided if the variables were there due to the theory of the model, which they should have been.
  5. Drop the intercorrelated variables from analysis but substitute their crossproduct as an interaction term, or in some other way combine the intercorrelated variables. This is equivalent to respecifying the model by conceptualizing the correlated variables as indicators of a single latent variable. Note: if a correlated variable is a dummy variable, other dummies in that set should also be included in the combined variable in order to keep the set of dummies conceptually together.
  6. Leave one intercorrelated variable as is but then remove the variance in its covariates by regressing them on that variable and using the residuals.
  7. Assign the common variance to each of the covariates by some probably arbitrary procedure.
  8. Treat the common variance as a separate variable and decontaminate each covariate by regressing them on the others and using the residuals. That is, analyze the common variance as a separate variable.
  9. Use orthogonal principal components factor analysis, employ the factors as the independents using principal components regression (PCR).
  10. Partial least squares may be used instead of OLS regression. PLS does not require absence of multicollinearity, but it lacks the power of OLS regression in discriminating among effects of independents and is best used in exploratory analysis and when prediction, not analysis of independents, is the object of research.
  11. Ridge regression is an attempt to deal with multicollinearity through use of a form of biased estimation in place of OLS. The method requires setting an arbitrary "ridge constant" which is used to produce estimated regression coefficients with lower computed standard errors. However, because picking the ridge constant requires knowledge of the unknown population coefficients one is trying to estimate, Fox (1991: 20) and others recommend against its use in most cases. SPSS has no ridge regression procedure, but its macro library has the macro ridge_regression.sps.
- How does "corresponding regressions" aid causal analysis?
- How does stepwise multiple regression relate to multicollinearity?
- What are forward inclusion and backward elimination in stepwise regression?
- What is part correlation in regression output?
- Can regression be used in place of ANOVA for analysis of categorical independents affecting an interval dependent?
- Does regression analysis require uncorrelated independent variables?
- How can you test the significance of the difference between two R-squareds?
- How do I compare b coefficients after I compute a model with the same variables for two subgroups of my sample?
- How do I compare regression results obtained for one group of subjects to results obtained in another group, assuming the same variables were used in each regression model?
- What do I do if I have more observations on my independents than on my dependents?
- What do I do if I am measuring the same independent at both the individual and group level? What is contextual analysis in regression?
- How do I test to see what effect a quadratic or other nonlinear term makes in my regression model?
- When testing for interactions, is there a strategy alternative to adding multiplicative interaction terms to the equation and testing for R² increments?
- What is "smoothing" in regression and how does it relate to dealing with nonlinearities in OLS regression?
- What is quantile regression?
- What is nonparametric regression for nonlinear relationships?
- What is Poisson regression?
Bibliography
- Achen, Christopher H. (1982). Interpreting and using regression. Series: Quantitative Applications in the Social Sciences, No. 29. Thousand Oaks, CA: Sage Publications. Introduction notable for its admonitions against over-reliance on R² and beta weights rather than unstandardized b coefficients and level importance in interpreting independent variables.
- Achen, Christopher H. (1991). A polychotomous linear probability model. Political Methodology Society. Berkeley, CA. Achen argues that using in regression an ordinal variable with fewer than 5 categories introduces overly biased results.
- Allison, Paul D. (1999). Multiple regression. Thousand Oaks, CA: Pine Forge Press. An excellent introductory text.
- Belsley, D.A., E. Kuh and R.E. Welsch (1980). Regression Diagnostics: Identification Influential Data and Sources of Collinearity, New York: John Wiley & Sons, Inc.. Standard reference on collinearity diagnostics.
- Berk, Richard A. (2003). Regression analysis: A constructive critique. Thousand Oaks, CA: Sage Publications.
- Berry, William D. (1993). Understanding Regression Assumptions. Series: Quantitative Applications in the Social Sciences, No. 92. Thousand Oaks, CA: Sage Publications. Excellent job at exactly what its title says, in more depth than given here.
- Breen, Richard (1996). Regression Models: Censored, Sample Selected, or Truncated Data, by Richard Breen. Quantitative Applications in the Social Sciences Series, No. 111. Thousand Oaks, CA: Sage Publications. Breen recommends Tobit when data are censored or sample selected.
- Breusch, T. & Pagan, A. (1979), A simple test for heteroscedasticity and random coefficient variation. Econometrica 47: 1287-1294.
- Bryk, A., S. W. Raudenbush, M. Seltzer, and R. T., Congdon (1988), An introduction to HLM. Chicago: University of Chicago. Describes the leading program for contextual analysis in regression.
- Chambers, William V. (1986). Inferring causality from corresponding variances. Perceptual and Motor Skills, Vol. 63: 475-478. This article lays the basis for the method of corresponding regressions.
- Chambers, William V. (1991). Inferring formal causation from corresponding regressions. The Journal of Mind and Behavior, Vol. 12, No. 1 (Winter): 49-70. This article sets forth the method of corresponding regressions, based on simulations and both natural and social science examples.
- Cohen, J. and P. Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. (2nd ed.). Lawrence Erlbaum Assoc. A widely used text.
- Cook, R. D. & Weisberg, S. (1982). Residuals and influence in regression, New York: Chapman and Hall.
- Cook, R. D. & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. Biometrika 70: 1–10.
- Draper, N.R., I. Guttman, and L. Lapczak (1979). Actual rejection levels in a certain stepwise test. Communications in Statistics. Vol. A8, pp. 99-105. The authors show how the alpha significance level becomes inflated during stepwise regression, increasing dramatically the chance of Type I errors.
- Fan, J. and I. Gijbels (1996). Local polynomial modelling and its applications. London: Chapman & Hall.
- Fox, John (1991). Regression Diagnostics. Thousand Oaks, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No. 79. Provides a thorough review of methods of testing the assumptions of regression models.
- Fox, John (2000a). Nonparametric simple regression. Thousand Oaks, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No.130. Covers local polynomial multiple regression in detail.
- Fox, John (2000b). Multiple and generalized nonparametric regression. Thousand Oaks, CA: Sage Publications. Quantitative Applications in the Social Sciences Series No.131. Covers local polynomial multiple regression, additive regression models, projection-pursuit regression, regression trees, and GLM nonparametric regression.
- Fox, John (2005). Linear models, problems. Pp. 515-522 in Kimberly Kempf-Leonard, ed., Encyclopedia of Social Measurement, Vol. 2. Amsterdam: Elsevier.
- Fuller, Wayne A.Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage Publications. Reviews maximum likelihood regression before going on to logit and tobit models for categorical dependents.
- Gujarati, Damodar N. (2006). Essentials of econometrics. Third edition. Boston: McGraw-Hill-Irwin.
- Hardy, Melissa A. (1993). Regression with dummy variables. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 93.
- Iverson, Gudmund R. (1991). Contextual analysis. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 81. Describes contextual analysis in regression.
- Jaccard, James, Robert Turrisi, and Choi K. Wan (1990). Interaction effects in multiple regression. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 72.
- Jewell, Nicholas P. (2003). Statistics for epidemiology.Boca Raton, FL: Chapman & Hall/CRC.
- Jöreskog, Karl G. (1999). How large can a standardized coefficient be? http://www.ssicentral.com/lisrel/techdocs/HowLargeCanaStandardizedCoefficientbe.pdf.
- Kahane, Leo H. (2001). Regression basics. Thousand Oaks, CA: Sage Publications.
- Kastellec, Jonathan P. & Leoni, Eduardo L. (2007). Using graphs instead of tables in political science. Perspectives on Politics, 5 , 755-771
- Menard, Scott (1995). Applied logistic regression analysis. Second edition, 2002. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 106.
- Miles, Jeremy and Mark Shevlin (2001). Applying regression and correlation. Thousand Oaks, CA: Sage Publications. Introductory text built around model-building.
- Pedhazur, E. J. (1997). Multiple regression in bahavioral research. Third ed. Orlando, FL: Harcourt, Brace, Jovanovich. On of the most popular reference texts.
- Schroeder, Larry D., David L. Sjoquist, and Paula E. Stephan (1986). Understanding regression analysis: An introductory guide. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 57.
- Tabachnick, Barbara G. and Linda S. Fidell (2001). Using Multivariate Statistics. Fourth Edition. Boston: Allyn and Bacon.
- Weisberg, S. (1985). Applied linear regression. 2d ed. New York: John Wiley & Sons.
- White, H. (1980). A heteroscedasticity-consistent covariance matrix estimator and a direct test for heteroscedasticity. Econometrica 48: 817-838.
  Examples of Use of Regression in Public Administration
- Clingermayer, James C. and Richard C. Feiock (1997). Leadership turnover, transacrtion costs, and external city service delivery. Public Administration Review, Vol. 57, No. 3 (May.June): 231-239.
- Thompson, R. W. and E. S. Warren (1997). The role of regression methods in the determination of Standard Spending Assessments Environment and Planning, C: Government and Policy (UK). Vol. 15, No. 1 (Feb.): 53-72. Analyzes use of regression in public administration context of national allocation to local governmental units.

Multiple Regression

Overview

Key Terms and Concepts

Assumptions

Example of SPSS Regression Output

Frequently Asked Questions

Bibliography