# The EM algorithm and mixed (random effects) model approaches to missing values

Multivariate procedures usually only use complete cases giving an accompanying loss of power. There are two ways to address this: estimating missing values using existing data (e.g. using the variable means) or using random effect models.

Howell gives a comprehensive and accessible overview and illustration of all these techniques here. This is well worth a read for getting a feel for the issues involved and how they can be addressed! In particular he mentions *Multiple Imputation* which averages regression estimates and obtains a pooled standard error from between 3 and 5 random samples on all the data. This procedure is available in **SPSS 17 and above** with existing SPSS procedures able to be used to obtain pooled estimates pooled over the five data sets generated. You can see worked examples using *multiple imputation in SPSS 17* here.

Another form of imputation (random/stochastic multiple regression imputation) can be used to fill-in missing values using estimates from a series of multiple regressions, each with a random error term and is recommended when upto 10% of the data is missing (Scheffer, 2002). This can be utilised using this spreadsheet. This technique is available in SPSS using the REGRESSION and ADD TYPE=NORMAL VARIATE option (as in syntax below) which will save the 'filled-in' data (consisting of Y and X) to a user specified data file (tester.sav) containing Y and X with estimated values replacing the missing values.

MVA VARIABLES=Y X /REGRESSION(TOLERANCE=0.001 FLIMIT=4.0 ADDTYPE=NORMAL OUTFILE='C:\tester.sav').

Howell, in particular, suggests that a better way to estimate missing values on a variable is by using a more complex approach than variable means, namely the EM algorithm. The EM algorithm uses the variable means and covariances to estimate the missing values and is available under analyse>missing value analysis from version 13 of SPSS or using PROC MI and PROC MIANALYSE in SAS or stand-alone freeware (NORM) which can be downloaded from here. **CBSU users: Don't use SPSS 13 to do this as it appears only to estimate missing values for a subset of the variables! (SPSS 16 works OK!)**

The EM option in SPSS can also be carried out using Graham and Hofer (1993)'s EMCOV23 which can be downloaded from here. The EM algorithm produces a 'filled in' (or imputed) data set for each specified variable with values estimated using the original data replacing the original missing values. The analysis can then be carried out using this filled-in data set (see some examples here). Note each missing data estimate in addition to using parameter estimates based on the original data also adds in a random error term which means we get different missing values each time we perform the estimation to account for sampling variability. This assumes the missingness on a variable is related only to values of other variables present in the data set and are therefore said to be *missing at random*.

To account for sampling variability Howell points out that multiple imputations are required. In practice this means that multiple 'filled in' data sets (typically 3 to 5 data sets) should be analysed to assess the consistency of the results across missing value estimates. Howell illustrates using the NORM downloadable software to obtain an overall result for multiple regression coefficients pooling over 5 imputated data sets and suggests a similar pooled approach can be used for other estimates provided they have their standard errors. However he prefers using random effects models for missing values in analysis of variance but is not sure how to combine results from these. Part of the problem is that effects which do not have a single degree of freedom will be represented by more than one model estimate.

He does notice in a further example that the F tests on each of three imputed data sets from a repeated measures analysis of variance are very similar. PROC MIANALYSE in SAS also combines results from multiple imputations. There is no such facility for combining results in SPSS (upto version 16 at least) but Van Ginkel(2008) illustrates ways of combining regression results in SPSS 19.0 (see SPSS syntax files and example data in this files called mi.zip, mi2.zip, mi-mul.zip and mi-mul2.zip downloadable from here.

Van Ginkel and van der Ark (2005) have SPSS syntax and example data for use with multiple imputation in questionnaire designs (labelled as tw.zip, tw-fl.zip and tw-ss.zip which can be downloaded from the above website or from here.

Random effect models, unlike the standard 'fixed effects' analysis of variance, use all cases irrespective of whether they contain missing values and therefore have a unique solution. These are available for use in most statistical packages such as SPSS (MIXED), SAS (MIXED) and R (LME). They are particularly useful for analysis of variance where it is wished to generalise results from the factors considered.

In the unusual situation where missingness is due to an impossibility of an event occurring e.g. asking a person about their sibling's occupation when they have no siblings or are not 'in touch' with them then a more dummy adjustment procedure (Cohen and Cohen, 2003) may suffice (Allison, 2002). This procedure simply uses the variable mean to fill in the missing value but then includes a variable as a covariate in the analysis taking a value of 0 except where a missing value occurs where it takes a value of '1'.

References

Allison P (2001) Missing Data. (Volume 136 in the Series: Quantitative applications in the social sciences). Sage:London.

Cohen, J and Cohen, P (2003) Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Lawrence Erlbaum:London.

Graham, JW (2009) Missing Data Analysis: Making It Work in the Real World *Annual Review of Psychology* **60** pp. 549-576. A paper found to be very useful for explaining practical issues and implementation associated with missing values. This paper is also available for downloading from here.

Graham JW and Hofer SM (1993) EMCOV.EXE Users Guide. Department of Biobehavioral Health, Pennsylvania State University; University Park, PA. unpublished manuscript.

Howell, D.C. (2008) The analysis of missing data. In Outhwaite, W. & Turner, S. Handbook of Social Science Methodology. London: Sage.

Little, R.J.A. & Rubin, D.B. (1987) Statistical analysis with missing data. New York, Wiley. This is a very comprehensive account of missing value analysis and is the 'bible' of missing value texts.

Schafer and Olson (1998) Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. *Multivariate Behavioral Research*, **33** 545–571.

Scheffer, J. (2002) Dealing with missing data. *Res. Lett. Inf. Math. Sci*, **3** 153-160. A pdf is here.

Van Ginkel, J. R., & Van der Ark, L. A. (2005). SPSS syntax for missing value imputation in test and questionnaire data. Applied Psychological Measurement, 29, 152-153

Van Ginkel, J.R. (2008). SPSS Syntax for Applying Rules for Combining Univariate Estimates in Multiple Imputation [computer software]. Retrieved: February 5, 2010, http://www.uvt.nl/mto/software2.html