Handling missing data

Copied from a webpage here which introduces the options from handling missing data from Karen Grace-Martin (founder of The Analysis Factor) who runs webinars and courses. She uses SPSS and R and has articles on her site to which you can post comments. Karen Grace-Martin has also co-authored an introductory text on simple analyses with SPSS (Sweet and Grace-Martin, 2012).

If the link above is broken the text is reproduced below:

EM Imputation and Missing Data: Is Mean Imputation Really so Terrible?

I’m sure I don’t need to explain to you all the problems that occur as a result of missing data. Anyone who has dealt with missing data—that means everyone who has ever worked with real data—knows about the loss of power and sample size, and the potential bias in your data that comes with listwise deletion.

Listwise deletion is the default method for dealing with missing data in most statistical software packages. It simply means excluding from the analysis any cases with data missing on any variables involved in the analysis.

A very simple, and in many ways appealing, method devised to overcome these problems is mean imputation. Once again, I’m sure you’ve heard of it–just plug in the mean for that variable for all the missing values. The nice part is the mean isn’t affected, and you don’t lose that case from the analysis. And it’s so easy! SPSS even has a little button to click to just impute all those means.

But there are new problems. True, the mean doesn’t change, but the relationships with other variables do. And that’s usually what you’re interested in, right? Well, now they’re biased. And while the sample size remains at its full value, the standard error of that variable will be vastly underestimated–and this underestimation gets bigger the more missing data there are. Too-small standard errors lead to too-small p-values, so now you’re reporting results that should not be there.

There are other options. Multiple Imputation and Maximum Likelihood both solve these problems. But while Multiple Imputation is not available in all the major stats packages, it is very labor-intensive to do well. And Maximum Likelihood isn’t hard or labor intensive, but requires using structural equation modeling software, such as AMOS or MPlus.

The good news is there are other imputation techniques that are still quite simple, and don’t cause bias in some situations. And sometimes (although rarely) it really is okay to use mean imputation. When?

If your rate of missing data is very, very small, it honestly doesn’t matter what technique you use. I’m talking very, very, very small (2-3%).

There is another, better method for imputing single values, however, that is only slightly more difficult than mean imputation. It uses the E-M Algorithm, which stands for Expectation-Maximization. It is an interative procedure in which it uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization). If not, it re-imputes a more likely value. This goes on until it reaches the most likely value.

EM imputations are better than mean imputations because they preserve the relationship with other variables, which is vital if you go on to use something like Factor Analysis or Regression. They still underestimate standard error, however, so once again, this approach is only reasonable if the percentage of missing data are very small (under 5%) and if the standard error of individual items is not vital (as when combining individual items into an index).

The heavy hitters like Multiple Imputation and Maximum Likelihood are still superior methods of dealing with missing data and are in most situations the only viable approach. But you need to fit the right tool to the size of the problem. It may be true that backhoes are better at digging holes than trowels, but trowels are just right for digging small holes. It’s better to use a small tool like EM when it fits than to ignore the problem altogether.

EM Imputation is available in SAS, Stata, R, and SPSS Missing Values Analysis module.

Karen goes on to mention when listwise deletion is appropriate.

When Listwise Deletion works for Missing Data

You may have never heard of listwise deletion for missing data, but you’ve probably used it.

Listwise deletion means that any individual in a data set is deleted from an analysis if they’re missing data on any variable in the analysis. It’s the default in most software packages.

Although the simplicity of it is a major advantage, it causes big problems in many missing data situations. But not always. If you happen to have one of the uncommon missing data situations in which listwise deletion doesn’t cause problems, it’s a reasonable solution.

You hear a lot about its problems because most data sets don’t fit two conditions that must hold for listwise deletion to work well.

So let’s talk about those two conditions and what the problems are when they’re not met.

When Listwise Deletion Works

1. The Data are Missing Completely at Random

When the incomplete cases that are dropped differ from the complete cases still in the sample, then the carefully selected random sample is no longer reflective of the entire population. You’ve now got a biased sample and biased results. That’s not good.

You can’t trust those results to be reflective of the population.

But sometimes the cases with missing data are no different than the complete cases—they are a purely random subset of the data. This is called Missing Completely at Random (MCAR).

If this holds, there won’t be any bias in analyses based on complete cases.

2. You have sufficient power anyway, even though you lost part of your data set

Dropping more than a few cases from a data set can have dramatic consequences for sample size. Since statistical power is directly tied to sample size, losing one results in losing the other.

But listwise deletion doesn’t always drop so many cases to adversely affect power. If the percentage of missing data is very small or you had an overly large sample to begin with, you may still have adequate power to detect meaningful effects.

There is one caveat here though. It’s possible to have only a small percentage of observations missing overall, yet still lose a large part of the sample to listwise deletion. This is the situation that’s most problematic for listwise deletion.

This happens when an analysis includes many variables, and each is missing for a few unique cases. Say you have a data set with 200 observations and use 10 variables in a regression model. If each variable is missing on the same 10 cases, you end up with 190 complete cases, 5% missing. Not bad.

But if you have a different 10 cases missing on each variable, you will lose 100 cases (10 cases by 10 variables). With only 5% missing data, you end up with 100 complete cases, 50% missing. Not so good.

How to Tell if Listwise Deletion is Reasonable

Before you just assume that listwise deletion is an adequate approach, it is important to establish that these two conditions are met.

Spend some time doing missing data diagnosis to understand patterns and randomness of missingness. Like testing assumptions in linear models, there isn’t one definitive test to tell you if assumptions are met for listwise deletion. It’s more an exercise in gathering evidence that assumptions aren’t clearly violated.

And if one or the other of these conditions are clearly violated, there are now other good ways to deal with missing data, including maximum likelihood and multiple imputation approaches.

Reference

Sweet, SA and Grace-Martin K (2012) Data Analysis with SPSS: A First Course in Applied Statistics Fourth Edition. Pearson:London.

MRC CBU Wiki

Quick Links

Search Wiki

Page Tools

Handling missing data