## Inflated standard errors in logistic regression

Problems interpreting logistic regression regression estimates is caused by having too good a fit!

Albert and Anderson (1984) observed that when you have perfect or near perfect prediction then the logistic regression regression estimates and standard errors are undefined. To see why this is suppose we are interested in comparing patients with controls on a pass/fail criteria. All the patients fail and all the controls pass. Now, the regression estimates in logistic regression represent odds ratios, for example the ratio of odds of passing to failing in the controls compared to the patients as given in the equation below.

OR = (A*B)/(C*D) where A=number of controls who pass, B=number of patients who fail, C=number of controls who fail and D=number of patients who pass.

The odds ratio (and by implication the regression estimates) are undefined in this example because no patients pass and no controls fail. The OR equation shows that the odds ratio is also undefined if either no controls fail or no patients pass. Let's call this scenario a case of *perfect prediction*. For another illustration see here.

In particular the *Wald* chi-square statistic, which SPSS evaluates, based on the square of the ratio of the regression estimate to its standard error, should not be used in these perfect fit cases because it grossly underestimates the effect of the predictor variables.

Instead twice the difference in log likelihoods should be used to assess the influence of a predictor variable (Rindskopf(2002)). Collett (1991) and Field (2013) also recommend the use of the likelihood ratio chi-square over the Wald chi-square particularly when the data are sparse as the likelihood ratio statistic, unlike the Wald statistic, is still well approximated by a chi-square distribution. The likelihood ratio statistic is obtained as follows: Fit the model with and without the predictor(s) of interest and compare the term called –2 log Likelihood in the model summary box. The difference between these is chi-squared on p degrees of freedom, where p variables have been dropped from the model. The p-value can be obtained using functions under transform:compute and can also be obtained in SPSS by fitting predictors in *blocks*. An example of this approach using R is here.

The chi-square obtained from differencing log likelihoods is more reliable because, unlike the Wald statistic, it does not depend on regression estimates and their standard errors which are not estimable because they are unbounded when we have perfect prediction. Instead it uses probabilities, to measure changes in model fit due to adding an subtracting predictors, which are always bounded between zero and 1! In particular, when we have perfect prediction these probabilities tend to zero and 1. For example, in our earlier scenario, the probability of a pass for a control is 1 and the probability of failure for a control is zero. This procedure does not, however, measure the association (odds ratio) between controls and patients and pass rate.

It is also possible to output an *exact* p-value (Mehta and Patel, 1995) for a test of a model predictor in logistic regression. This procedure does produces an estimated odds ratio even when an odds ratio is not able to be estimated, using more traditional likelihood methods, because of the occurrence of zero frequencies. A procedure for producing exact odds ratios and exact p-values is available using the LOGISTIC procedure in SAS. Rindskopf (2002), however, suggests these exact odds ratios do not always give good predictions.

References

Albert A. and Anderson J.A. (1984). On the existence of maximum likelihood estimates in logistic regression models. *Biometrika* **71**, 1-10.

Collett D. (1991). Modelling binary data. Chapman and Hall:London.

Field A. (2013). Discovering statistics using IBM SPSS Statistics. Fourth Edition. Sage:London.

Hosmer D.W. and Lemeshow S. (2000). Applied logistic regression. 2nd Edition. Wiley:New York pp135-142. IN CBSU LIBRARY.

Mehta, C.R. and Patel N.R. (1995). Exact logistic regression:Theory and examples. *Statistics in Medicine*, **14**, 2143-2160.

Rindskopf D. (2002). Infinite parameter estimates in logistic regression: opportunities, not problems. *Journal of Educational and Behavioral Statistics* **27(2)** 147-161.