An introduction to modern missing data analyses

https://doi.org/10.1016/j.jsp.2009.10.001Get rights and content

Abstract

A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional techniques. This article explains the theoretical underpinnings of missing data analyses, gives an overview of traditional missing data techniques, and provides accessible descriptions of maximum likelihood and multiple imputation. In particular, this article focuses on maximum likelihood estimation and presents two analysis examples from the Longitudinal Study of American Youth data. One of these examples includes a description of the use of auxiliary variables. Finally, the paper illustrates ways that researchers can use intentional, or planned, missing data to enhance their research designs.

Introduction

Missing data are ubiquitous in quantitative research studies, and school psychology research is certainly not immune to the problem. Because of its pervasive nature, some methodologists have described missing data as “one of the most important statistical and design problems in research” (methodologist William Shadish, quoted in Azar, 2002, p. 70). Despite the important nature of the problem, substantive researchers routinely employ old standby techniques that have been admonished in the methodological literature. For example, excluding cases with missing data is a strategy that is firmly entrenched in statistical software packages and is exceedingly common in disciplines such as psychology and education (Peugh & Enders, 2004). This practice is at odds with a report by the American Psychological Association Task Force on Statistical Inference (Wilkinson & American Psychological Association Task Force on Statistical Inference, 1999, p. 598) that stated that deletion methods “are among the worst methods available for practical applications.”

It is not a surprise that substantive researchers routinely employ missing data handling techniques that methodologists criticize. For one, software packages make these approaches very convenient to implement. The fact that software programs offer outdated procedures as default options is problematic because the presence of such routines implicitly sends the wrong message to applied researchers. In some sense, the technical nature of the missing data literature is a substantial barrier to the widespread adoption of sophisticated missing data handling options. While many of the flawed missing data techniques (e.g., excluding cases, replacing missing values with the mean) are easy to understand, newer approaches are considerably more difficult to grasp. The primary purpose of this article is to give a user-friendly introduction to these modern missing data methods.

A great deal of recent methodological research has focused on two “state of the art” missing data methods (Schafer & Graham, 2002): maximum likelihood and multiple imputation. Accordingly, the majority of this paper is devoted to these techniques. Quoted in the American Psychological Association's Monitor on Psychology, Stephen G. West, former Editor of Psychological Methods, stated that, “Routine implementation of these new methods of addressing missing data will be one of the major changes in research over the next decade” (Azar, 2002). Although applications of maximum likelihood and multiple imputation are appearing with greater frequency in published research articles, a substantial gap still exists between the procedures that the methodological literature recommends and those that are actually used in the applied research studies (Bodner, 2006, Peugh and Enders, 2004, Wood et al., 2004). Consequently, the overarching purpose of this manuscript is to provide an overview of maximum likelihood estimation and multiple imputation, with the hope that researchers in the field of school psychology will employ these methods in their own research. More specifically, this paper will explain the theoretical underpinnings of missing data analyses, give an overview of traditional missing data techniques, and provide accessible descriptions of maximum likelihood and multiple imputation. In particular, we focus on maximum likelihood estimation and present two analysis examples from the Longitudinal Study of American Youth (LSAY). Finally, the paper illustrates ways that researchers can use intentional missing data to enhance their research designs.

Section snippets

Theoretical background: Rubin's missing data mechanisms

Before we can begin discussing different missing data handling options, it is important to have a solid understanding of so-called “missing data mechanisms”. Rubin (1976) and colleagues (Little & Rubin, 2002) came up with the classification system that is in use today: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). These mechanisms describe relationships between measured variables and the probability of missing data. While these terms have a

An overview of traditional missing data techniques

Traditionally, researchers have employed a wide variety of techniques to deal with missing values. The most common of these techniques include deletion and single imputation approaches (Peugh & Enders, 2004). The goal of this section is to provide an overview of some of these common traditional missing data techniques and to illustrate the shortcomings of these procedures. To illustrate the bias that can result from the use of traditional missing data methods, we use the artificial math

Modern missing data techniques

Maximum likelihood estimation and multiple imputation are considered “state of the art” missing data techniques (Schafer & Graham, 2002) and are widely recommended in the methodological literature (e.g. Schafer and Olsen, 1998, Allison, 2002, Enders, 2006). These approaches are superior to traditional missing data techniques because they produce unbiased estimates with both MCAR and MAR data. Furthermore, maximum likelihood and multiple imputation tend to be more powerful than traditional data

Analysis example 1

The previous bivariate analysis example used a fictitious and somewhat unrealistic data set. This section describes analyses of a real-world data set that uses maximum likelihood to estimate correlations, means, standard deviations, and a multiple regression model. Here, we have chosen to limit our analytic example to maximum likelihood estimation because this procedure is often easier to implement than multiple imputation. Every commercially available structural equation modeling software

Using auxiliary variables to fine-tune a maximum likelihood analysis

Recall from the earlier discussion of missing data mechanisms that MAR is not a characteristic of the entire data set, but is a situation that depends on the variables included in the analysis. Since maximum likelihood and multiple imputation require the MAR assumption, adding so-called auxiliary variables to an analysis can help fine-tune the missing data handling procedure, either by reducing bias or by increasing power. Auxiliary variables are additional variables not required to answer the

Analysis example 2

To demonstrate the use of auxiliary variables in a maximum likelihood analysis, we will continue with the previous LSAY example. In this particular example, we chose three auxiliary variables: 9th grade math scores, a variable that quantifies the amount of math and science resources in the home, and mother's level of education. In general, a useful auxiliary variable is a potential cause or correlate of missingness or a correlate of the incomplete variables in the analysis model (Collins et

Using missing data to your advantage: planned missingness designs

Thus far, we have described methods for dealing with unintentional missing data. The development of modern missing data techniques has made planned missing data research designs a possibility. The idea of intentionally creating missing data may feel counterintuitive because missing data is generally thought of as a nuisance and something to avoid. However, the reader may already be familiar with planned missing data designs. For instance, in a classic experimental design, subjects are randomly

Discussion

Historically, researchers have relied on a variety of ad hoc techniques to deal with missing data. The most common of these ad hoc techniques include deletion methods or techniques that attempt to fill in each missing value with a single substitute. Some of the traditional missing data techniques require strict assumptions regarding the reason why data are missing (i.e., the MCAR mechanism) and only work in a limited set of circumstances. Others (e.g., mean substitution) never work well. As a

References (31)

  • L.S. Aiken et al.

    Multiple regression: Testing and interpreting interactions

    (1991)
  • P.D. Allison

    Missing data

    (2002)
  • B. Azar

    Finding a solution for missing data

    Monitor on Psychology

    (2002)
  • T.E. Bodner

    Missing data: Prevalence and reporting practices

    Psychological Reports

    (2006)
  • K.A. Bollen

    Structural equations with latent variables

    (1989)
  • L.M. Collins et al.

    A comparison of inclusive and restrictive strategies in modern missing data procedures

    Psychological Methods

    (2001)
  • C.K. Enders

    A primer on the use of modern missing-data methods in psychosomatic medicine research

    Psychosomatic Medicine

    (2006)
  • C.K. Enders

    Applied missing data analysis

    (2010)
  • C. Enders et al.

    Modern alternatives for dealing with missing data in special education research

  • J.W. Graham

    Adding missing-data relevant variables to FIML-based structural equation models

    Structural Equation Modeling: A Multidisciplinary Journal

    (2003)
  • J.W. Graham et al.

    Analysis with missing data in prevention research

  • J.W. Graham et al.

    Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures

    Multivariate Behavioral Research

    (1996)
  • J.W. Graham et al.

    How many imputations are really needed? Some practical clarifications of multiple imputation theory

    Prevention Science

    (2007)
  • J.W. Graham et al.

    Planned missing data designs in the analysis of change

  • J.W. Graham et al.

    Planned missing data designs in psychological research

    Psychological Methods

    (2006)
  • Cited by (0)

    View full text