FAQ/missing - CBU statistics Wiki

Revision 24 as of 2013-08-28 12:43:47

Clear message
location: FAQ / missing

How do I handle missing data in SPSS?

Missing values are problematic in multivariate analyses because they reduce the number of cases as cases with any incomplete information are automatically dropped. One simplistic approach to this problem is to 'fill in' the missing values using variable means (or medians) which is OK if you only have a few missing values (say 5 to 8% of a sample). This choice is an example of single imputation using just one variable to 'fill-in' its missing value. Single imputation is methodologically appropriate given that with small amounts of missing data single imputation performs almost equally well as other more sophisticated imputation techniques (Peyre et al., 2011, Shrive et al., 2006).

The below illustrates how to use macros to replace missing values with variable means in SPSS and assumes missing values are missing completely at random so the missing values are not likely to differ in value from those that are recorded. There are other approaches that can be used (for an overview see here.)

There are, however, more complex approaches (namely the EM algorithm, multiple imputation and mixed random effect models) to handling missing values which are detailed here. These approaches have gained popularity and are now available to use in most statistical packages including SPSS (see here) and SAS (see here) which also details the concepts underlying multiple imputation. Dufouil, Brayne and Clayton (2004) weight cases who have attrition due to data lost to follow-up by the inverse probabilities of staying in the study to account for possible bias in differential dropouts between groups being compared. This approach is not generally available but can be implemented in STATA.

Examination of the missing data can be performed using group analyses such as non-parametric Mann-Whitney U tests to compare the group of subjects with missing values to those with compelte cases to check if the the missing data mechanism is related to other variables in the data set (Tabachnick and Fidell, 2007) ie is missing at random (MAR). If there is no relationship between data values and the 'missingness' group one might be inclined to treat the missinng values as missing completely at random (MCAR). Tabachnick and Fidell (2007) point out that if less than 5% are missing completely at random almost any procedure for handling missing values yields similar results. Everett and Dunn (1991) recommend conducting a complete case analysis for cases where there are few missing values and the data are missing completely at random.

Below are two macros for performing one of the simplest imputations, replacing missing values with variable means, in SPSS. Suppose we have 50 variables labelled in consecutive columns aq1 to aq50. The below macro will identify only complete cases. For larger proportions of missing data (say > 10%) multiple imputation and the EM algorithm are suggested.

compute ind=1.
exe.

define !inmiss ( !pos !tokens(1)
                          / !pos !tokens(1)) .
!do !i=!1 !to !2.
if missing(!concat(aq,!i)) ind=ind*0.
!doend.
!enddefine.

!inmiss 1 50.
exe.
USE ALL.
COMPUTE filter_$=(ind=1).
VARIABLE LABEL filter_$ 'ind=1 (FILTER)'.
VALUE LABELS filter_$  0 'Selected' 1 'Not Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE .

The below macro will replace the missing values with the variable mean

define !inmiss ( !pos !tokens(1)
                          / !pos !tokens(1)) .
!do !i=!1 !to !2.
rmv /!concat(aq,!i,a)=smean(!concat(aq,!i)).
compute !concat(aq,!i,a) = rnd(aq,!1,a).
!doend.
!enddefine.

!inmiss 1 50.
exe.

As the items are dichotomous hence can only take two values we could consider rounding up the imputed means so that they take values that can actually occur. For 50 variables called aq1a to aq50a the below syntax rounds up their inputed values and places the results in variables y1 to y50.

do repeat r=aq1a to aq50a /y = y1 to y50.
compute y=rnd(r).
end repeat.
exe.

References

Dufouil, C., Brayne, C. & Clayton, D. (2004). Analysis of longitudinal studies with death and drop-out: a case study. Statistics in Medicine, 23, 2215-2226.

Everitt, B. S. & Dunn, G. (1991). Applied multivariate data analysis. London:Edward Arnold.

Peyre, H., Leplège, A. & Coste, J. (2011). Missing data methods for dealing with missing items in quality of life questionnaires. A comparison by simulation of personal mean score, full information maximum likelihood, multiple imputation, and hot deck techniques applied to the SF-36 in the French 2003 decennial health survey. Quality of Life Research 20 287-300.

Shrive, F. M., Stuart, H., Quan, H. & Ghali, W. A. (2006). Dealing with missing data in a multiquestion depression scale: a comparison of imputation methods. BMC Medical Research Methodology 6 57.

Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (Fifth ed.). Boston: Pearson Education, Inc.