|
|
Overview
The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables. See also a variety of alternatives related to OLS regression:
|
|

Note: t-tests are not used for dummy variables, even though SPSS and other statistical packages output them -- see Frequently Asked Questions section below. Note also that the t-test is a test only of the unique variance an independent variable accounts for, not of shared variance it may also explain, as shared variance while incorporated in R2, is not reflected in the b coefficient.
One- vs. two-tailed t tests. Also note that t-tests in SPSS and SAS are two-tailed, which means they test the hypothesis that the b coefficient is either significantly higher or lower than zero. If our model is such that we can rule out one direction (ex., negative coefficients) and thus should test only if the b coefficient is more than zero, we want a one-tailed test. The two-tailed significance level will be twice the one-tailed probability level: if SPSS reports .10 for a two-tailed test, for instance, then the one-tailed equivalent significance level is .05.
For large samples, SEE approximates the standard error of a predicted value. SEE is the standard deviation of the residuals. In a good model, SEE will be markedly less than the standard deviation of the dependent variable. In a good model, the mean of the dependent variable will be greater than 1.96 times SEE.

In SPSS, the F test appears in the ANOVA table, shown above for the example of number of auto accidents predicted from gender and age. Note that the F test is too lenient for the stepwise method of estimating regression coefficients and an adjustment to F is recommended (see Tabachnick and Fidell, 2001: 143 and Table C.5). In SPSS, select Analyze, Regression, Linear; click Statistics; make sure Model fit is checked to get the ANOVA table and the F test. Here the model is significant at the .001 level, which is the same as shown in the Model Summary table.
Generally, the beta method and the model comparison method will show the same IVs to be most important, but it easily can happen that an IV will have a beta approaching zero but still have an appreciable effect on R2 when it is dropped from the model, because its joint effects are appreciable even if its unique effect is not. The beta method is related to partial correlation, which is relative to the variability of the dependent variable after partialling out from the dependent variable the common variance associated with the control IVs (the IVs other than the one being considered in partial correlation). The model comparison method is related to part correlation, which is relative to the total variability of the dependent variable.

Mathematically, R2 = (1 - (SSE/SST)), where SSE = error sum of squares = SUM((Yi - EstYi)squared), where Yi is the actual value of Y for the ith case and EstYi is the regression prediction for the ith case; and where SST = total sum of squares = SUM((Yi - MeanY)squared). Sums of squares are shown in the ANOVA table in SPSS output, where the example computes R2 = (1-(1140.1/1172.358)) = 0.028. The "residual sum of squares" in SPSS output is SSE and reflects regression error. Thus R-square is 1 minus regression error as a percent of total error and will be 0 when regression error is as large as it would be if you simply guessed the mean for all cases of Y. Put another way, the regression sum of squares/total sum of squares = R-square, where the regression sum of squares = total sum of squares - residual sum of squares. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Model fit is checked to get R2.

The Model Summary table in SPSS output, shown above, gives R, R2, adjusted R2, the standard error of estimate (SEE), R2 and F change and the corresponding significance level, and the Durbin-Watson statistic. In the example above, number of accidents is predicted from age and gender. This output shows age and gender together explain nly 2.4% of the variance in number of accidents for this sample. R2 is close to adjusted R2 because there are only two independent variables (adjusted R2 is discussed below). R2 change is the same as R2 because the variables were entered at the same time (not stepwise or in blocks), so there is only one regression model to report, and R2 change is change from the intercept-only model, which is also what R2 is.. R2 change is discussed below. Since there is only one model, "Sig F Change" is the overall significance of the model, which for one model is also the significance of adding the sex and age to the model in addition to the intercept. The Durbin-Watson statistic is a test to see if the assumption of independent observations is met, which is the same as testing to see if autocorrelation is present. As a rule of thumb, a Durbin-Watson statistic in the range of 1.5 to 2.5 means the researcher may reject the notion that data are autocorrelated (serially dependent) and instead may assume independence of observations, as is the case here. The Durbin-Watson test is discussed below.
Below, SPSS prints out R2 change and its corresponding significance level. For Block/Model 1, R2 change is .05 and as significant at better than the .001 level. This is the change from the intercept-only model to the model with years education as a predictor. Because there is only one independent (number of grandparents born abroad) added in Block 2, the significance of the R2 change is the same as the significance of the b coefficient for that variable, namely a non-significant .397 in this example. While for this simple example use of the R2 change method is not necessary, for more complex tests, such as testing sets of independents or testing dummy variables as discussed below, the R2 change method is standard.
F-incremental = [(R2with - R2without)/m] / [(1 - R2)/df]
where m = number of IVs in new block which is added; and df = N - k - 1 (where N is sample size; k is number of indpendent variables). F is read with m and df degrees of freedom to obtain a p (probability) value. Note the without model is nested within the with model. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure R squared change is checked to get "Sig F Change".
The beta weights for the equation in the final step of stepwise regression do not partition R2 into increments associated with each independent because beta weights are affected by which variables are in the equation. The beta weights estimate the relative predictive power of each independent, controlling for all other independent variables in the equation for a given model. The R2 increments estimate the predictive power an independent brings to the analysis when it is added to the regression model, as compared to a model without that variable. Beta weights compare independents in one model, whereas R2 increments compare independents in two or more models.
This means that assessing a variable's importance using R2 increments is very different from assessing its importance using beta weights. The magnitude of a variable's beta weight reflects its relative explanatory importance controlling for other independents in the equation. The magnitude of a variable's R2 increment reflects its additional explanatory importance given that common variance it shares with other independents entered in earlier steps has been absorbed by these variables. For causal assessments, beta weights are better (though see the discussion of corresponding regressions for causal analysis). For purposes of sheer prediction, R2 increments are better.
The removal of outliers from the data set under analysis can at times dramatically affect the performance of a regression model. Outliers should be removed if there is reason to believe that other variables not in the model explain why the outlier cases are unusual -- that is, outliers may well be cases which need a separate model. Alternatively, outliers may suggest that additional explanatory variables need to be brought into the model (that is, the model needs respecification). Another alternative is to use robust regression, whose algorithm gives less weight to outliers but does not discard them.

In the figure above, for the example of predicting auto accidents from sex and age, the Casewise Diagnostics tables shows two outliers: cases 166 and 244.
There will be a t value for each studentized deleted residual, with df = n - k - 1, where k is the number of independent variables. When t exceeds the critical value for a given alpha level (ex., .05) then the case is considered an outlier. In a plot of deleted studentized residuals versus ordinary residuals, one may draw lines at plus and minus two standard units to highlight cases outside the range where 95% of the cases normally lie. Points substantially off the straight line are potential leverage problems.


In the partial regression plot above, for the example of sex and age predicting car accidents, sex is being used first to predict accidents, then to predict age. Since sex does not predict age at all and predicts only a very small percentage of accidents, the pattern of residuals in the partial regression plot forms a random cloud. This case, where the dots do not form a line, does not indicate lack of linearity of age with accidents but rather correlation approaching zero.
When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable.The more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient.
| Rj | Tolerance | VIF | Impact on SEb |
|---|---|---|---|
| 0 | 1 | 1 | 1.0 |
| .4 | .84 | 1.19 | 1.09 |
| .6 | .64 | 1.56 | 1.25 |
| .75 | .44 | 2.25 | 1.5 |
| .8 | .36 | 2.78 | 1.67 |
| .87 | .25 | 4.0 | 2.0 |
| .9 | .19 | 5.26 | 2.29 |
Standard error is doubled when VIF is 4.0 and tolerance is .25, corresponding to Rj = .87. Therefore VIF >= 4 is an arbitrary but common cut-off criterion for deciding when a given independent variable displays "too much" multicollinearity: values above 4 suggest a multicollinearity problem. Some researchers use the more lenient cutoff of 5.0 or even 10.0 to signal when multicollinearity is a problem. The researcher may wish to drop the variable with the highest VIF if multicollinearity is indicated and theory warrants.

The figure above is output for the example of predicting accidents from gender and age. This further confirms that this example has no collinearity problem since no condition index approaches 30, making it unnecessary to examine variance proportions.
Panel data regression models may be in one of three types: fixed, between, or random effects.
Note: Adding variables to the model will always improve R2 at least a little for the current data, but it risks misspecification and does not necessarily improve R2 for other datasets examined later on. That is, it can overfit the regression model to noise in the current dataset and actually reduce the reliability of the model.
Sometimes specification is phrased as the assumption that "independent variables are measured without error." Error attributable to omitting causally important variables means that, to the extent that these unmeasured variables are correlated with the measured variables which are in the model, the b coefficients will be off. If the correlation is positive, then b coefficients will be too high; if negative, too low. That is, when a causally important variable is added to the model, the b coefficients will all change, assuming that variable is correlated with existing measured variables in the model (usually the case).
The population error term, which is the difference between the actual values of the dependent and those estimated by the population regression equation, should be uncorrelated with each of the independent variables. Since the population regression line is not known for sample data, the assumption must be assessed by theory. One common type of correlated error occurs due to selection bias with regard to membership in the independent variable "group" (representing membership in a treatment vs. a comparison group): measured factors such as gender, race, education, etc., may cause differential selection into the two groups and also can be correlated with the dependent variable.
In practical terms, the researcher must be confident that (1) the variables not included in the equation are indeed not causes of Y and are not correlated with the variables which are included; and (2) that the dependent is not also a cause of one or more of the independents. Either circumstance would violate the assumption of uncorrelated error. When there is correlated error, conventional computation of standard deviations, t-tests, and significance are biased and yield invalid results. Coefficients may be over- or under-estimated.
Note that residual error -- the difference between observed values and those estimated by the sample regression equation -- will always be uncorrelated with the independents and therefore the lack of correlation of the residuals with the independents is not a valid test of this assumption.
That is, the problem with multicollinearity is that it increases the standard errors of the b coefficients and corresponding beta weights. Beta weights are used to assess the importance of the predictors. If the beta weights have larger standard errors, it is harder to distinguish among them. This poses the greatest problem when referring to multicollinearity among the independents, and most multicollinearity warnings in textbooks are referring to this type of multicollinearity. If there are two sets of variables on the predictor side of the equation, (1) independents of interest and (2) control variables, and if there is no multicollinearity within set (1) but there is multicollinearity between the two sets, one can still use beta weights to distinguish within set (1) but the control variables may reduce the effect sizes of the set (1) variables (except in the case of suppression, when the effect size may actually increase).
In regression, as a rule of thumb, nonlinearity is generally not a problem when the standard deviation of the dependent is more than the standard deviation of the residuals. Linearity is further discussed in the section on data assumptions. Note also that regression smoothing techniques and nonparametric regression exist to fit smoothed curves in a nonlinear manner.

An alternative for the same purpose is the normal probability plot, with the observed cumulative probabilities of occurrence of the standardized residuals on the Y axis and of expected normal probabilities of occurrence on the X axis, such that a 45-degree line will appear when observed conforms to normally expected. For the same example, the P-P plot below shows the same moderate departure from normality.

The F test is relatively robust in the face of small to medium violations of the normality assumption. The central limit theorem assumes that even when error is not normally distributed, when sample size is large, the sampling distribution of the b coefficient will still be normal. Therefore violations of this assumption usually have little or no impact on substantive conclusions for large samples, but when sample size is small, tests of normality are important. That is, for small samples, where error is not normally distributed, t-tests of regression coefficients are not accurate.
Histograms and P-P plots may be selected under the Plots button in the SPSS regression dialog. Alternatively, in SPSS, select Graphs, Histogram; specify sre_1 as the variable (this is the studentized residual, previously saved with the Save button in the regression dialog). One can also test the residuals for normality using a Q-Q plot: in SPSS, select Graphs, Q-Q; specify the studentized residual (sre_1) in the Variables list; click OK. Dots should approximate a 45 degree line when residuals are normally distributed.
Note that regression is relatively robust regarding homoscedasticity and small to moderate violations of homoscedasticity have only minor impact on regression estimates (Fox, 2005: 516). When homoscedasticity is a problem weighted least squares regression and quantile regression are commonly recommended. WLS regression causes cases with smaller residuals to be weighted more in calculating the b coefficients, diminishing the impact of heteroscedasticity. Square root, log, and reciproval transformations of the dependent may also reduce or eliminate lack of homoscedasticity but should only be undertaken if such transforms are sensibly interpretable and do not upset the relationship of the dependent to the independents. Finally, robust estimation of standard errors is an increasingly popular option for dealing with heteroscedasticity as such estimators are less sensitive to error variance misspecification. Robust estimators are available in SPSS when regression (or other models) is accomplished through the generalized linear models module, discussed in a separate section of Statnotes.
When there is lack of homoscedasticity, the regression model will be working better from some ranges of the dependent variable than for others (ex., better for high values than for low values). If homoscedasticity is violated, separate models may be required for the different ranges. Also, when the homoscedasticity assumption is violated "conventionally computed confidence intervals and conventional t-tests for OLS estimators can no longer be justified" (Berry, 1993: 81). While lack of homoscedasticity does not bias the b coefficient estimates, comnputed standard error will be too high if error variance is positively correlated with the independents and too low if negatively correlated.
Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables. Nonconstant error variance can be observed by requesting simple residual plots, as in the illustrations below, where "Training" as independent is used to predict "Score" as dependent:
As illustrated in the admittedly extreme case comparison below, in which Training causes Score, altering two independent variable values to be outliers can change a strong R-square to a nul one. The influence of these two outliers, cases 2 and 3 in the illustration below, is measured by dfFit and dfBeta, discussed below.
Outliers may be flagged by use of residuals, influence statistics, and distance measures.
A rule of thumb is that cases with leverage under .2 are not a problem, but if a case has leverage over .5, the case has undue leverage and should be examined for the possibility of measurement error or the need to model such cases separately. In SPSS, the minimum, maximum, and mean leverage is displayed by SPSS in the "Residuals Statistics" table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog. Also, select Analyze, Regression, Linear; click Save; check Leverage to add these values to your data set as an additional column.
Graphical method with data labels: Influential cases with high leverage can be spotted graphically. Save lev_1 in the SPSS Save procedure above, then select Graphs, Scatter/Dot; select Simple Scatter; click Define; make lev_1 the Y axis and caseid the X axis; be sure to make an appropriate variable (like Name) the "Label cases by" variable; OK. Then double-click on the plot to bring up the Chart Editor; select Elements, Data Label Mode; click on cases high on the Y axis.
As a rule of thumb, the maximum Mahalanobis distance should not exceed the critical chi-squared value with degrees of freedom equal to number of predictors and alpha =.001, or else outliers may be a problem in the data. The minimum, maximum, and mean Mahalanobis distances are displayed by SPSS in the "Residuals Statistics" table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog.
Fox (1991: 34) suggests as a cut-off for detecting influential cases, values of D greater than 4/(N - k - 1), where N is sample size and k is the number of independents. Others suggest D > 1 as the criterion to constitute a strong indication of an outlier problem, with D > 4/n the criterion to indicate a possible problem. In SPSS, the minimum, maximum, and mean Cook's D is displayed by SPSS in the "Residuals Statistics" table when "Casewise diagnostics" is checked under the Statistics button in the Regression dialog. Also, select Analyze, Regression, Linear; click Save; check Cook's to add these values to your data set as an additional column.
One can also spot outliers graphically using Cook's distance, which highlights very (unduly) influential cases. In SPSS, save Cook;s distance (coo_1) using the Save button in the Regression dialog. Then elect Graphs, Scatter/Plot, Simple Scatter; click Define; let coo_1 be the Y axis and case number be the X axis; click OK. If the graph shows any points far off the line, you can lanel them by case number. Double-click in the chart to bring up the Chart Editor, then select Elements, Data Label Mode, then click on the outlying dot(s) to make the label(s) appear.
Alternatively, the d value has an association p probability value for various significance cutoffs (ex., .05). For a given level of significance such as .05, there is an upper and a lower d value limit. If the computed Durbin-Watson d value for a given series is more than the upper limit, the null hypothesis of no autocorrelation is not rejected and it is assumed that errors are serially uncorrelated. If the computed d value is less than the lower limit, the null hypothesis is rejected and it is assumed that errors are serially correlated. If the computed value is in-between the two limits, the result is inconclusive. In SPSS, one can obtain the Durbin-Watson coefficient for a set of residuals by opening the syntax window and running the command, FIT RES_1, assuming the residual variable is named RES_1.
For a graphical test of serial independence, a plot of studentized residuals on the Y axis against the sequence of cases (the caseid variable) on the X axis should show no pattern, indicating independence of errors. In SPSS, select Graphs, Scatter/Dot, Simple Scatter; specify sre_1 (the studentized residual, previously saved with the Save button in the regression dialog) as the Y axis and caseid as the X Axis; OK; double-click the graph to bring up the Chart Editor; select Options, Y Axis Reference Line; click Properties, specify 0 for the Y axis position; click Apply; Close.
In the illustration below, bank data are used to see if credit card debt predicts age. The plot of the ID variable on the X axis against studentized residuals on the Y axis whows no pattern, indicating data are not autocorrelated.
When autocorrelation is present, one may choose to use generalized least-squares (GLS) estimation rather than the usual ordinary least-squares (OLS). In iteration 0 of GLS, the estimated OLS residuals are used to estimate the error covariance matrix. Then in iteration 1, GLS estimation minimizes the sum of squares of the residuals weighted by the inverse of the sample covariance matrix.
As an independent: The regression model makes no distributional assumptions about the independents, which may be discrete variables as long as other regression assumptions are met. The discreteness of ordinal variables is thus not a problem, but do ordinal variables approach intervalness? Ordinal variables must be interpreted with great care when there are known large violations of intervalness, such as where it is known that rankings obscure large gaps between, say the top three ranks and all the others. In most cases, however, methodologists simply use a rule-of-thumb that there must be a certain minimum number of classes in the ordinal independent (Achen, 1991, argues for at least 5; Berry (1993: 47) states five or fewer is "clearly inappropriate"; others have insisted on 7 or more). However, it must be noted that use of 5-point Likert scales in regression is extremely common in the literature.
As a dependent: Ordinal dependents are more problematic because their discreteness violates the regression assumptions of normal distribution of error with constant variance. A conservative method is to test to see if there are significant differences in the regression equation when computed separately for each value class of the ordinal dependent. If the independents seem to operate equally across each of the ordinal levels of the dependent, then use of an ordinal dependent is considered acceptable. The more liberal and much more common approach is to allow use of ordinal dependents as long as the number of response categories is not very small (at least 5 or 7, see above) and the responses are not highly concentrated in a very small number of response categories.
Three considerations govern which category to leave out. Since the b coefficients for dummy variables will reflect changes in the dependent with respect to the reference group (which is the left-out group), it is best if the reference group is clearly defined. Thus leaving out the "Other" or "Miscellaneous" category is not a good idea since the reference comparisons will be unclear, though leaving out "North" in the example above would be acceptable since the reference is well defined. Second, the left-out reference group should not be one with only a small number of cases, as that will not lead to stable reference comparisons. Third, some researchers prefer to leave out a "middle" category when transforming ordinal categories into dummy variables, feeling that reference comparisons with median groups are better than comparisons with extremes.
Regression coefficients should be assessed for the entire set of dummy variables for an original variable like "Region" (as opposed to separate t-tests for b coefficients as is done for interval variables). For a regression model in which all the independents are dummies for one original ordinal or nominal variable, the test is the F test for R-squared. Otherwise the appropriate test is the F test for the difference of R-squareds for the model with the set of dummies and the model without the set.
F = [(R22 - R12)/(k2 - k1)]/[(1-R22)/(n - k2 -1)]
There are three methods of coding dummy variables. Coding greatly affects the magnitude and meaning of the b and beta coefficients, but not their significance. Coding does not affect the R-squared for the model or the significance of R-squared, as long as all dummy variables save the reference category are included in the model.
In general, the b coefficients are the distances from the dummy values to the reference value, controlling for other variables in the equation, and the distance from the reference category to the other dummy variables will be the same in a model in which the reference (omitted) categories are switched. Another implication is that the distance from one included dummy value to another included value (ex., from East to West in the example in which North is the omitted reference category) is simply the difference in their b coefficients. Thus if the b coefficient for West is 1.6, then we may say that the effect of East is .5 units more (2.1 - 1.6 = .5) than the West effect, where the effect is still gauged in terms of unit increases in the dependent variable compared to being in the North. For "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for the dummy "South" means that the expected education level for the South is 1.5 years less than the average of "North" respondents.
Some textbooks say the b coefficient for a dummy variable is the difference in means between the two values of the dummy (0,1) variable. This is true only if the variable is a dichotomy. In general, the b coefficient for a given dummy variable is the difference in means between the given dummy variable and omitted reference dummy variable. For dichotomies, there will be only one given dummy variable and the other value will be the omitted reference category and so it is a special case in which the b coefficient is the difference in means between the two values of the dummy variable.
In an experimental context, the omitted reference group would ordinarily be the control group.
For this example, a regression coefficient (b) of -1.5 for the South effect variable means that the intecept is reduced by 1.5, meaning that the expected education level for the South is 1.5 years less than the unweighted mean of the expected values for all subgroups. That is, where binary coding interprets b for the dummy category (South) relative to the reference group (the left-out category), effects coding interprets it relative to the entire set of groups. A positive b coefficient for any included group (other than the -1 group, West) means it scored higher on the response variable than the grand mean for all subgroups, or if negative, then lower. A significant b coefficient for any included group means that group is significantly different on the response variable from the grand mean. Under effect coding there is no comparison between the group coded -1 and the grand mean.
To compare the first cluster with the second, the cluster of interest (managers and white-collar) would thus be coded +.5 each (1 divided by the 2 categories in the cluster), and the other categories of the reference cluster as -.33 each (-1 divided by the 3 categories). Contrast code(s) will sum to zero across all categories. To contrast managers v. white-collar only, considering managers as the category of interest (coded +1), white-collar the reference category (coded -1), and all others as the third cluster (coded 0). The group contrast is the b coefficient times ([nint + nref]/[( nint)*(nref)], where n is the number of categories for the cluster of categories of interest (int) or the reference cluster (ref).
A significant b coefficient means the variables or clusters of variables being contrasted are significantly different on the response variables. Under contrast coding, the b coefficients do not have a clear interpretation in terms of group means on the response variable.
In SPSS, categorical regression is invoked from the menus by selecting Analyze, Regression, Optimal Scaling; then specifying the dependent variable and independent variable(s). Optionally, one may change the scaling level for each variable. Scaling level choices are nominal, ordinal, or numeric (interval), plus spline nominal and spline ordinal (the spline choices create a smoother but less well-fitting curve).
CATREG output includes frequencies, regression coefficients, the ANOVA table, the iteration history, category quantifications, correlations between untransformed predictors, correlations between transformed predictors, residual plots, and transformation plots. Selecting the Coefficients option gives three tables: a Coefficients table that includes betas, standard error of the betas, t values, and significance; a Coefficients-Optimal Scaling table with the standard error of the betas taking the optimal scaling degrees of freedom into account; and a table with the zero-order, part, and partial correlation, Pratt’s relative importance measure for the transformed predictors, and the tolerance before and after transformation.
CATREG assumes category indicators are positive integers. There is a Discretize button in the categorical regression dialog box to convert fractional-value variables and string variables into positive integers. There is only one dependent variabl, up to 200 predictors (in SPSS 12), and the number of valid cases must exceed the number of predictor variables plus one. Note that CATREG is equivalent to categorical canonical correlation analysis with optimal scaling (OVERALS) with two sets, one of which (the dependent or response set) contains only one variable. Scaling all variables at the numerical level corresponds to standard multiple regression analysis.
Warning: Optimal scaling recodes values on the fly to maximize goodness of fit for the given data. As with any atheoretical, post-hoc data mining procedure, there is a danger of overfitting the model to the given data. Therefore, it is particularly appropriate to employ cross-validation, developing the model for a training dataset and then assessing its generalizability by running the model on a separate validation dataset.
See Fuller (1987), who for a not untypical data set estimated attenuation coefficients of .98 for gender, .88 for education level, and .58 for poverty status. That is, attenuation is a non-trivial problem which can lead to serious underestimation of regression coefficients. The variance of the residuals is the estimate of error variance, assuming all relevant variables are in the equation and all irrelevant variables are omitted.
The method of causal inference through corresponding regressions was subsequently set out by Chambers (1991). Consider bivariate regression of y on x, but where there is uncertainty about whether the causal direction should not be the opposite. In corresponding regressions, y is regressed on x, and the absolute deviations (predicted minus actual values of y) are noted as a measure of the extremity of prediction errors. Next the deviations of the x values from the mean of x are taken to give a measure of the extremity of the predictor values. The two columns of deviations are correlated, giving the deviation correlation for y, labeled rde(y). The deviation correlation will be negative, since when predictor values are extreme, errors should be less since high values of the predictor lead to high values of the dependent, and low values to low values. The regression is then repeated for the regression of x on y, giving rde(x).
When the real independent variable serves as a predictor, there should be a higher correlation than when the real dependent serves as predictor. That is, the rde() value is higher when the real independent serves as the predictor. This is because mid-range predictor values (as measured by low extremity of predictor variables) should be associated with mid-range dependent values (as measured by the extremity of errors) only when the true independent is used as the predictor of the true dependent. Chambers' D is rde(y) - rde(x). When the true independent is x and the true dependent is y, D will be negative. That is, only if x is the true independent and y is the true dependent will rde(y) be more negative than rde(x), and D will have a negative value after subtraction. If it does not, Chambers recommends assuming no correlation of the two variables (1991: 12).
Assumptions of corresponding regressions
Note that corresponding regressions is a controversial method not yet widely accepted and applied in the social science literature.
Note that ANOVA is not interchangeable with regression for two reasons: (1) ANOVA cannot handle continuous variables as it is a grouped procedure. While continuous variables can be coded into categories, this loses information and attenuates correlation; and (2) ANOVA normally requires approximately equal n's in each group formed by the intersection of the independent variables. Equal group sizes is equivalent to orthogonality among the independent variables. Regression allows correlation among the IVs (up to a point, lower than multicollinearity) and thus is more suitable to non-experimental data. Methods exist in ANOVA to adjust for unequal n's but all are problematic.
F = [(R22 - R12)/(k2 - k1)]/[(1-R22)/(n - k2 -1)]
Where
R22 = R-square for the second model (ex., one with interactions or with an added independent)
R12 = R-square for the first, restricted model (ex., without interactions or without an added independent)
n = total sample size
k2 = number of predictors in the second model
k1 = number of predictors in the first, restricted model
F has (k2 - k1) and (n - k2 -1) degrees of freedom and tests the null hypothesis that the R2 increment between the two models is not significantly different from zero.
Instead Iverson proposes a "relative effects model" in which the individual-level measure would be, for this example, the individual's ability score minus the group (team) mean, and the group-level measure would be the group team mean minus the overall mean on ability for all teams. This transformation, which must be warranted by the theory of the model being assessed, usually eliminates or greatly reduces the multicollinearity problem. In the relative effects model, one would then regress performance on the relative individual ability measures, employing a separate regression for each team. The constant is the value of performance when the individual ability is the same as the team mean (not zero, as in the absolute effects model). If the b coefficients vary from team to team, this indicates a group effect.
To investigate the group effect using the single-equation method one regresses performance on the relative individual, group, and interaction variables, generating coefficients corresponding to the individual, group, and interaction effects. (Iverson also describes a separate equation method which generates the same estimates but the single-equation method usually has smaller standard errors). The standardized coefficients (beta weights) in this regression allow comparison of the relative importance of the individual, group, and interaction effects. This comparison does not suffer from multicollinearity as the relative effects transformations leave the variables with little or no correlation in most cases. Iverson (ppp. 64-66) also describes an alternative of partitioning the sums of squares to assess individual vs. group vs. interaction effects.
The models Iverson discusses can be done in SPSS or SAS, but one must compute the relative individual, group, and interaction variables manually. This can become tedious or nearly impossible in large models. Consequently, various packages for contextual analysis have been created, including GENMOD (Population Studies Center, University of Michigan), ML3 (Multilevel Models Project, Institute of Education, University of London), and the one most popular, HLM (see Bryk et al., 1988). Iverson briefly mentions these packages but provides no discussion of the steps involved in their use.
The researcher can set the level of exponentiation (including 1 = the linear case), but cubic polynomial fitting is typical. Thus, for simple one-independent variable models, let x0 be the value of x at the bin focal point, and let xi be the value of x at any of i other points within the bin. In the cubic case, the polynomial regression equation would be yi = b1(xi - x0) + b2(xi - x0)2 + b3(xi - x0)3 + c.
The span, s, for the bandwidth can also be set by the researcher. One method is simply visual trial and error using various values of s, seeking the smallest s which still generates a smooth curve.
A purpose of using quantile regression is to analyze these plots to better understand how outliers or other violations of OLS assumptions affect the distribution of y estimates at different quantile values. Quantile regression is preferred when heteroscedasticity is present. For hereoscedastic data, quantile regression will generate different beta weights for different quantiles. The single ols beta weight for a given independent variable may be revealed to be a bad average of beta weights at different quantiles. These beta weights at different values may reveal the effect, or even the direction of effect, of a given independent variable to vary by quantile, giving a more complex and sophisticated understanding of its importance.
Starting with SPSS 17, the R-language SPSS extension QUANTREG implemented quantile regression. See documentation at http://en.wikibooks.org/wiki/Statistics:Numerical_Methods/Quantile_Regression. Documentation states one must install R 2.7 or higher, the R plug-in, the Python plug-in, the R Quantreg package, and the SPSSINC QUANTREG extension package. To install an R package, start R and use the Packages>Install Packages menu item..
For handling nonlinear relationships in a regression context, nonparametric regression is now considered preferable to simply adding polynomial terms to the regression equation (as is done, for instance, in SPSS through Analyze, Regression, Nonlinear menu choice). Nonparametric regression methods allow the data to influence the shape (curve) of a regression line. Note that this means that nonparametric regression is usually an atheoretical method, not involving positing the model in advance, but instead deriving it from the data. Consequently, fitting a curve to noise in the data is a critical concern in nonparametric regression. Nonparametric regression is treated in Fan and Gijbels (1996) and Fox (2000b), who covers local polynomial multiple regression, additive regression models, projection-pursuit regression, regression trees, and GLM nonparametric regression.
Local polynomial multiple regression makes the dependent variable a single nonlinear function of the independent variables. Local regression fits a regression surface not for all the data points as in traditional regression, but for the data points in a "neighborhood." Researchers determine the "smoothing parameter," which is a specified percentage of the sample size, and neighborhoods are the points within the corresponding radius. In the loess method, weighted least squares is used to fit the regression surface for each neighborhood, with data points in a neighborhood weighted in a smooth decreasing inverse function of their distance from the center of the neighborhood. Alternative to this nearest-neighbor smoothing, one may define bands instead of neighborhood spans, with bandwidths being segments of the range of the independent variable(s). The fitting of surfaces to neighborhoods may be done at a sample of points in predictor space, or at all points. Regardless, the surfaces are then blended together to form the curved line or curved surface characteristic of nonparametric regression. SAS implements local regression starting in Version 8 in its proc loess procedure. As of version 10, SPSS does not directly implement nonparametric regression though its website does provide an java applet demo. See Fox (2000b: 8-26).
Problems of local regression. Fox (2000b: 20) refers to "the curse of dimensionality" in local regression, noting that as the number of predictor variables increases, the number of data points in the local neighborhood of a focal point tends to decline rapidly. This means that to obtain a given percentage of data points, the smoothing parameter radius must become less and less local. Other problems of local regression include (1) its post-hoc atheoretical approach to defining the regression curve; (2) the fact that dynamic inference from the b coefficients is no longer possible due to nonlinearity, requiring graphical inference instead; and (3) graphical display becomes difficult to comprehend when more than three independent variables are in the model (Fox, 2000b: 26, recommends coplots as the best display alternative).
Additive regression models allow the dependent variable to be the additive sum of nonlinear functions which are different for each of the independent variables. This means that the dependent variable equals the sum of a series of two-dimensional partial regressions. For the dependent y and each independent x, one can predict adjusted y as a local regression function of x. The adjustment has to control y for other independents in the equation. An iterative method called backfitting simultaneously solves the nonlinear functions for each independent (x) term, and the dependent is the additive sum of these terms. See Fox (2000b: 27-37).
Note it is also possible to have a semi-parametric regression model in which some of the independent variables have nonparametric functions as described above, while others have conventional regression coefficients. In particular, a semi-parametric model would be appropriate if dummy variable terms were present: dummy variables would be entered as linear terms. Additive models have the same problems of interpretation as local regression.
Projection-pursuit regression first reduces attribute space by creating latent variables which are regression functions of the raw independent variables, then makes the dependent the additive sum of nonlinear functions which are different for each of these latent variables. A purpose of projection-pursuit regression is that by reducing the number of variables in local regression and by making the dependent an additive function of a series of bivariate partial regressions, the "curse of dimensionality" problem mentioned above is mitigated. The price paid, however, is that, as Fox notes, "arbitrary linear combinations of predictors do not usually correspond to substantively meaningful variables" and difficulty in interpreting the resulting nonparametric regression is multiplied. At least with additive regression models, for instance, one can interpret the partial regression coefficient signs as indications of the direction of effect of individual predictor variables. See Fox (2000b: 37-47).
Regression trees employ successive binary divisions of predictor attribute space, making the dependent variable a function of a binning and averaging process. Also called the AID (automatic interaction detection) method, regression trees are classification trees for continuous data. There are several different algorithms for creating regression trees, but they all involve successive partitioning of cases into smaller and smaller bins based on one or more independent variables. A stopping criterion for the partitioning might be when the bins have 10 cases or less. For instance, one branch might be: if income < 32564 then if education < 14.2 then job satisfaction = 88.9, where 88.9 is the mean for the cases in that bin. Cutting points for branching the tree are set to minimize classification error as reflected in the residual sum of squares. As algorithms may produce an over-complex tree attuned to noise in the data, the research may "prune" the tree to trade off some increase in error in order to obtain less complexity in the tree. SPSS supports regression trees in its AnswerTree product. See Fox (2000b: 47-58).
Problems of regression trees. Use of automated tree algorithms commonly results in overfitting of trees (too much complexity, such that the branching rules seem arbitrary and unrelated to any theory of causation among the variables). This can be compensated in part by developing the tree for one set of data, then cross-validating it on another. Regression trees can be difficult to interpret because small changes in cutting points can have large impacts on branching in the tree. Branching is also affected by data density and sparseness, with more branching and smaller bins in data regions where data points are dense. In general, regression trees are more useful where the purpose is creating decision rules than when the purpose is causal interpretation.
GLM nonparametric regression allows the logit of the dependent variable to be a nonlinear function of the logits of the independent variables. While GLM techniques like logistic regression are nonlinear in that they employ a transform (for logistic regression, the natural log of the odds of a dependent variable) which is nonlinear, in traditional form the result of that transform (the logit of the dependent variable) is a linear function of the terms on the right-hand side of the equation. GLM non-parametric regression relaxes the linearity assumption to allow nonlinear relations over and beyond those of the link function (logit) transformation. See Fox (2000b: 58-73).
Methodology
Examples of Use of Regression in Public Administration
Copyright 1998, 2008, 2009, 2010, 2011 by G. David Garson.
Last update: 3/16/2011.