FAQ/missing - CBU statistics Wiki

Revision 35 as of 2015-01-08 09:57:55

Clear message
location: FAQ / missing

How do I handle missing data in SPSS?

Missing values are problematic in multivariate analyses because they reduce the number of cases as cases with any incomplete information are automatically dropped. One simplistic approach to this problem is to 'fill in' the missing values using variable means (or medians) which is OK if you only have a few missing values (say 5% of a sample (Tabachnick and Fidell,(2007), p.63 although Peng et al. (2006) suggest mean imputation is permissible provided no more than a more liberal 10-20% of data is missing)). You can also assess sampling variability by replacing missing values with subject minima and maxima to see how sensitive results are to choices of missing values. These choices are examples of single imputation using just one variable to 'fill-in' its missing value. Single imputation is methodologically appropriate given that with small amounts of missing data single imputation performs almost equally well as other more sophisticated imputation techniques (Peyre et al., 2011, Shrive et al., 2006 and here).

The below illustrates how to use macros to replace missing values with variable means in SPSS and assumes missing values are missing completely at random so the missing values are not likely to differ in value from those that are recorded. There are other approaches that can be used (for an overview see here) which assume data is missing at random ie the reason for the missingness is associated with some of the other observed variables.

Examination of the missing data can be performed using group analyses such as non-parametric Mann-Whitney U tests to compare the group of subjects with missing values to those with complete cases to check if the the missing data mechanism is related to other variables in the data set (Tabachnick and Fidell, 2007) ie is missing at random (MAR). If there is no relationship between data values and the 'missingness' group one might be inclined to treat the missing values as missing completely at random (MCAR). Tabachnick and Fidell (2007) point out that if less than 5% are missing completely at random almost any procedure for handling missing values yields similar results. Everett and Dunn (1991) recommend conducting a complete case analysis for cases where there are few missing values and the data are missing completely at random.

More complex approaches (namely the EM algorithm, multiple imputation and mixed random effect models) are need if we have missingness related to the observed variables, missing at random, (see here). These approaches have gained popularity and the EM algorithm and mixed effect models are now available to use in most statistical packages including SPSS (see here), MPLUS, the confirmatory factor analysis software, which uses maximum likelihood analysis based upon covariance matrices using regression and factor analysis models to handle missing values and SAS (see here) which also details the concepts underlying the more involved multiple imputation which combines results from the same analysis on data with missing values replaced by different estimates. SPSS can perform analyses on different data sets but however does not (version 22) compute estimates pooled across these data sets.

Various methods of pooling the estimates obtained from multiple imputation samples have been suggested including Raghunathan and Dong (2011) who give a simple approach for combining mean squares from analyses of variance and Van Ginkel and Kroonenberg (2014) who present SPSS macros for doing the same thing. G. Dufouil, Brayne and Clayton (2004) alternatively weight cases who have attrition due to data lost to follow-up by the inverse probabilities of staying in the study to account for possible bias in differential dropouts between groups being compared. This approach is not generally available but can be implemented in STATA.

Below are two macros for performing one of the simplest imputations, replacing missing values with variable means, in SPSS. Suppose we have 50 variables labelled in consecutive columns aq1 to aq50. The below macro will identify only complete cases. For larger proportions of missing data (say > 10%) multiple imputation and the EM algorithm are suggested.

compute ind=1.
exe.

define !inmiss ( !pos !tokens(1)
                          / !pos !tokens(1)) .
!do !i=!1 !to !2.
if missing(!concat(aq,!i)) ind=ind*0.
!doend.
!enddefine.

!inmiss 1 50.
exe.
USE ALL.
COMPUTE filter_$=(ind=1).
VARIABLE LABEL filter_$ 'ind=1 (FILTER)'.
VALUE LABELS filter_$  0 'Selected' 1 'Not Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE .

The below macro will replace the missing values with the variable mean

define !inmiss ( !pos !tokens(1)
                          / !pos !tokens(1)) .
!do !i=!1 !to !2.
rmv /!concat(aq,!i,a)=smean(!concat(aq,!i)).
compute !concat(aq,!i,a) = rnd(aq,!1,a).
!doend.
!enddefine.

!inmiss 1 50.
exe.

As the items are dichotomous hence can only take two values we could consider rounding up the imputed means so that they take values that can actually occur. For 50 variables called aq1a to aq50a the below syntax rounds up their inputed values and places the results in variables y1 to y50.

do repeat r=aq1a to aq50a /y = y1 to y50.
compute y=rnd(r).
end repeat.
exe.

References

Dufouil, C., Brayne, C. & Clayton, D. (2004). Analysis of longitudinal studies with death and drop-out: a case study. Statistics in Medicine, 23, 2215-2226.

Everitt, B. S. & Dunn, G. (1991). Applied multivariate data analysis. London:Edward Arnold.

Peng, C. Y. J., Harwell, M., Liou, S. M. & Ehman, L. H. (2006) Advances in missing data methods and implications for educational research. In Real data analysis (ed S. Sawilowsky) 31-78. Information Age: Greenwich, CT.

Peyre, H., Leplège, A. & Coste, J. (2011). Missing data methods for dealing with missing items in quality of life questionnaires. A comparison by simulation of personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques applied to the SF-36 in the French 2003 decennial health survey. Quality of Life Research 20 287-300.

Raghunathan, T. E. and Dong, Q. (2011). Analysis of Variance from Multiply Imputed Data sets, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, Michigan. (Unpublished research report - see here.).

Shrive, F. M., Stuart, H., Quan, H. & Ghali, W. A. (2006). Dealing with missing data in a multiquestion depression scale: a comparison of imputation methods. BMC Medical Research Methodology 6 57.

Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (Fifth ed.). Boston: Pearson Education, Inc.

Van Ginkel, J.R. & Kroonenberg, P.M. (2014). Analysis of variance of multiply imputed data. /Multivariate Behavioral Research, 39,/ 78-91. SPSS macros for combining analyses of variance results are available from Joost Van Ginkel's homepage located here. doi: 10.1080/00273171.2013.855890