Regression diagnostics for categorical variables

Some people feel a little anxious expressing correlations between dichotomous variables and a continuous variable in a regression, for example, as input for multicollinearity diagnostics.

When we have have a dichotomous variable (or dummy variable) in a simple regression the correlation with the outcome measure is termed a point-biserial correlation. Rosenthal, R. (1994) shows that this correlation is related both to the F and t statistics and also to the difference in group means expressed in terms of the pooled group standard deviation.

In particular, for the former two,

$$r(pb) = \mbox{the square root of } [ \mbox{t}2 / (t2 + df) ] $$


F(1,df) = [ df(Residual) r(pb) r(pb) ] / [ (1-r(pb)r(pb) ) ]

For the more general case of a categorical predictor, representing k groups, say, Rsq, the square of the semi-partial correlation for the categorical predictor with outcome is related to the F value by

F(k-1,df) = [df(Residual)/(k-1)] [Rsq /(1-Rsq)]

Semi-partial R-squared for group, Rsq(group), is defined as

Rsq(group) = Rsq(all predictors) - Rsq(removing group)

Semi-partial R-squareds and F ratios are routinely used as indicators of predictive strength in simple and multiple regressions. Cohen, J. Cohen, P. (1983), for example, give an example of semi-partial correlations in a four predictor multiple regression involving sex.

As an alternative to the above the StepAIC procedure in R can be used to select the best fitting models by comparing model Akaike Information Criteria (AICs) as described by Venables and Ripley (2002).


Cohen, J. Cohen, P. (1983) Applied multiple regression/correlation analysis for the behavioral sciences. Second edition. Lawrence Erlbaum:London.

Rosenthal, R. (1994) Parametric measures of effect size. In H.Cooper amd L.V. Hedges (Eds) The handbook of research synthesis. New York: Russell Sage Foundation.

Venables, W. N., Ripley, B. D., (2002). Modern Applied Statistics with S. 4th edition. New York: Springer.