FAQ/missing

How do I handle missing data in SPSS?

Missing values are problematic in multivariate analyses because they reduce the number of cases as cases with any incomplete information are automatically dropped. One simplistic approach to this problem is to 'fill in' the missing values using variable means (or medians) which is OK if you only have a few missing values (say 5% of a sample (Tabachnick and Fidell,(2007), p.63 and also here although Peng et al. (2006) suggest mean imputation is permissible provided no more than a more liberal 10-20% of data is missing)). You can also assess sampling variability by replacing missing values with subject minima and maxima to see how sensitive results are to choices of missing values. These choices are examples of single imputation using just one variable to 'fill-in' its missing value. Single imputation is methodologically appropriate given that with small amounts of missing data single imputation performs almost equally well as other more sophisticated imputation techniques (Peyre et al., 2011, Shrive et al., 2006 and here).

The below illustrates how to use macros to replace missing values with variable means in SPSS and assumes missing values are missing completely at random so the missing values are not likely to differ in value from those that are recorded. Shrive, Stuart, Quan and Ghali (2006) perform simulations suggesting using within-subject item means can be used to impute missing data. There are other approaches that can be used (for an overview see here) which assume data is missing at random ie the reason for the missingness is associated with some of the other observed variables.

Examination of the missing data can be performed using group analyses such as non-parametric Mann-Whitney U tests to compare the group of subjects with missing values to those with complete cases to check if the the missing data mechanism is related to other variables in the data set (Tabachnick and Fidell, 2007) ie is missing at random (MAR). If there is no relationship between data values and the 'missingness' group one might be inclined to treat the missing values as missing completely at random (MCAR). Tabachnick and Fidell (2007) point out that if less than 5% are missing completely at random almost any procedure for handling missing values yields similar results. Everett and Dunn (1991) recommend conducting a complete case analysis for cases where there are few missing values and the data are missing completely at random. Pigott (2001, p.362) agrees saying that when a data set has only a few missing observations, the assumption of MCAR data is more likely to apply implying that there is a greater chance of the complete cases representing the population when only a few cases are missing.

More complex approaches (namely the EM algorithm, multiple imputation and mixed random effect models) are need if we have missingness related to the observed variables, missing at random, (see here). These approaches have gained popularity and the EM algorithm and mixed effect models are now available to use in most statistical packages including SPSS (see here), MPLUS, the confirmatory factor analysis software, which uses maximum likelihood analysis based upon covariance matrices using regression and factor analysis models to handle missing values and SAS (see here) which also details the concepts underlying the more involved multiple imputation which combines results from the same analysis on data with missing values replaced by different estimates. SPSS can perform analyses on different data sets but however does not (version 22) compute estimates pooled across these data sets. Shin, Davison and Long (2017) suggest maximum likelihood approaches such as those used in the EM algorithm and random effect models are less biased than multiple imputation in handling missing data. See also here for a discussion of the options.

Nan Laird mentions that the EM algorithm, based upon summary measures from incomplete data, may be used to estimate mixed model parameters (Laird and Hirschland, 2021).

Various methods of pooling the estimates obtained from multiple imputation samples have been suggested including Raghunathan and Dong (2011) who give a simple approach for combining mean squares from analyses of variance and Van Ginkel and Kroonenberg (2014) who present SPSS macros for doing the same thing. Thom Baguley illustrates here how to perform multiple imputation in R. A pdf downloaded from Thom's website is given here. G. Dufouil, Brayne and Clayton (2004) alternatively weight cases who have attrition due to data lost to follow-up by the inverse probabilities of staying in the study to account for possible bias in differential dropouts between groups being compared. This approach is not generally available but can be implemented in STATA.

Below are two macros for performing one of the simplest imputations, replacing missing values with variable means, in SPSS. Suppose we have 50 variables labelled in consecutive columns aq1 to aq50. The below macro will identify only complete cases. Schafer & Graham (2002) propose a role of person mean substitution, averaging of available items if multiple imputation is not feasible showing that there is no bias introduced using complete cases if the missingness is not due to the values of any other variables (Missing Completely at Random). For larger proportions of missing data (say > 10%) multiple imputation and the EM algorithm are suggested. The EM approach it should, however, be noted only performs single imputation. Chakraborty and Gu (2009) find that random effect mixed models perform well relative to procedures using multiple imputation.

[A nice summary from Jeremy Miles is below]

The missing data procedures are of two forms: full information maximum likelihood (FIML) or multiple imputation (MI). Multilevel models are inherently FIML too. Jeremy mentions he has written a recent paper about missing data procedures in the Journal of Criminal Psychology - See Miles and Hunt (2015).

Don't keep only complete cases, that's a really bad idea.

If a variable is a predictor only, FIML doesn't really help, and you need to go down the MI route (the MI route is about 100 times easier than it used to be).

Sometimes people use inverse probability weighting, where you weight people so that the later waves match the earlier waves on the variables. That's probably a pain, and unless you have a representative sample to start with (hey! We're psychologists! we never have representative samples!) not worth the effort.

There is a fairly nice book by Paul Allison (2003) entitled "Missing Data", it's one of the Sage little green books.

Jakobsen et al (2017) suggest using mixed models when interactions are of interest and complete cases if no more than 5% of the data is missing and reporting only results for observed data if large amounts of data are missing (e.g. 40% or more). Scheffer (2002) suggests complete cases can be used if no more than 6% of the data is missing, single imputation if no more than 10% of the data is missing and more complex procedures such as multiple imputation if between 10% and 25% of the data is missing.

compute ind=1.
exe.

define !inmiss ( !pos !tokens(1)
                          / !pos !tokens(1)) .
!do !i=!1 !to !2.
if missing(!concat(aq,!i)) ind=ind*0.
!doend.
!enddefine.

!inmiss 1 50.
exe.
USE ALL.
COMPUTE filter_$=(ind=1).
VARIABLE LABEL filter_$ 'ind=1 (FILTER)'.
VALUE LABELS filter_$  0 'Selected' 1 'Not Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE .

The below macro will replace the missing values with the variable mean

define !inmiss ( !pos !tokens(1)
                          / !pos !tokens(1)) .
!do !i=!1 !to !2.
rmv /!concat(aq,!i,a)=smean(!concat(aq,!i)).
compute !concat(aq,!i,a) = rnd(aq,!1,a).
!doend.
!enddefine.

!inmiss 1 50.
exe.

As the items are dichotomous hence can only take two values we could consider rounding up the imputed means so that they take values that can actually occur. For 50 variables called aq1a to aq50a the below syntax rounds up their inputed values and places the results in variables y1 to y50.

do repeat r=aq1a to aq50a /y = y1 to y50.
compute y=rnd(r).
end repeat.
exe.

The optimal number of multiple imputations to use was examined by Bodner (2008), who relied on simulations, and White et al. (2011), who analytically derived an approximation to the Monte Carlo error of the p-value . Despite their different approaches, both sources agreed on the following simplified rule of thumb: the number of imputations should be similar to the percentage of cases that are incomplete.

References

Allison, P. D. (2003). Missing Data. Sage:London.

Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling: A Multidisciplinary Journal 15 651-675.

Chakraborty, H. & Gu, H. (2009). A mixed model approach for intent-to-treat analysis in longitudinal clinical trials with missing values (Report No. MR-0009-0903). Research Triangle Park, NC: RTI Press. DOI: 10.3768/rtipress.2009.mr.0009.0903

Dufouil, C., Brayne, C. & Clayton, D. (2004). Analysis of longitudinal studies with death and drop-out: a case study. Statistics in Medicine, 23, 2215-2226.

Enders, C. K. (2010) Applied missing data analysis. Guilford Press: New York. Features macros for handling missing data.

Everitt, B. S. & Dunn, G. (1991). Applied multivariate data analysis. London:Edward Arnold.

Jakobsen, J. C., Gluud, C. Wettersley, J. and Winkel, P. (2017). When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Medical Research 17:162.

Laird N and Hirschland E (2021) From the Apollo programme to the EM algorithm and beyond. Significance 18(4) 34-37.

Miles, J. N. V. and Hunt, P. (2015). A practical introduction to methods for analyzing longitudinal data in the presence of missing data using a marijuana price survey. Journal of Criminal Psychology 5(2), 137-148

Peng, C. Y. J., Harwell, M., Liou, S. M. & Ehman, L. H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis (ed S. Sawilowsky) 31-78. Information Age: Greenwich, CT.

Pigott, T. D. (2001). A review of methods for missing data. Educational Research and Evaluation 7(4) 353-383.

Peyre, H., Leplège, A. & Coste, J. (2011). Missing data methods for dealing with missing items in quality of life questionnaires. A comparison by simulation of personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques applied to the SF-36 in the French 2003 decennial health survey. Quality of Life Research 20 287-300.

Raghunathan, T. E. and Dong, Q. (2011). Analysis of Variance from Multiply Imputed Data sets, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, Michigan. (Unpublished research report - see here.).

Schafer, J. L. and Graham, J. W. (2002). Missing Data: Our View of the State of the Art. Psychological Methods 7(2) 147-177.

Scheffer, J. (2002). Dealing with missing data. Res. Lett. Math. Sci. 3 153-160.

Shin, T., Davison, M.L. and Long, J.D. (2017). Maximum likelihood versus multiple imputation for missing data in small longitudinal samples with nonnormality. Psychological Methods 22(3) 426-449.

Shrive, F. M., Stuart, H., Quan, H. & Ghali, W. A. (2006). Dealing with missing data in a multiquestion depression scale: a comparison of imputation methods. BMC Medical Research Methodology 6 57.

Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (Fifth ed.). Boston: Pearson Education, Inc.

Van Ginkel, J.R. & Kroonenberg, P.M. (2014). Analysis of variance of multiply imputed data. /Multivariate Behavioral Research, 39,/ 78-91. SPSS macros for combining analyses of variance results are available from Joost Van Ginkel's homepage located here. doi: 10.1080/00273171.2013.855890

White, I. R., Royston P. and Wood A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30 377-399.