Original Article
Quality criteria were proposed for measurement properties of health status questionnaires

https://doi.org/10.1016/j.jclinepi.2006.03.012Get rights and content

Abstract

Objectives

Recently, an increasing number of systematic reviews have been published in which the measurement properties of health status questionnaires are compared. For a meaningful comparison, quality criteria for measurement properties are needed. Our aim was to develop quality criteria for design, methods, and outcomes of studies on the development and evaluation of health status questionnaires.

Study Design and Setting

Quality criteria for content validity, internal consistency, criterion validity, construct validity, reproducibility, longitudinal validity, responsiveness, floor and ceiling effects, and interpretability were derived from existing guidelines and consensus within our research group.

Results

For each measurement property a criterion was defined for a positive, negative, or indeterminate rating, depending on the design, methods, and outcomes of the validation study.

Conclusion

Our criteria make a substantial contribution toward defining explicit quality criteria for measurement properties of health status questionnaires. Our criteria can be used in systematic reviews of health status questionnaires, to detect shortcomings and gaps in knowledge of measurement properties, and to design validation studies. The future challenge will be to refine and complete the criteria and to reach broad consensus, especially on quality criteria for good measurement properties.

Introduction

The number of available health status questionnaires has increased dramatically over the past decades. Consequently, the choice of which questionnaire to use is becoming a major difficulty. Recently a large number of systematic reviews have been published of available questionnaires measuring a specific concept in a specific population, for example [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]. In these systematic reviews, typically, the content and measurement properties of the available questionnaires are compared. In analogy to systematic reviews of clinical trials, criteria are needed to determine the methodological quality of studies on the development and evaluation of health status questionnaires. In addition, criteria for good measurement properties are needed to legitimize what the best questionnaire is.

Several articles offer criteria for the evaluation of questionnaires. Probably the best-known and most comprehensive criteria are those from the Scientific Advisory Committee (SAC) of the Medical Outcomes Trust [12]. The SAC defined eight attributes of instrument properties that warrant consideration in evaluation. These include (1) conceptual and measurement model, (2) validity, (3) reliability, (4) responsiveness, (5) interpretability, (6) respondent and administrative burden, (7) alternative forms, and (8) cultural and language adaptations (translations). Within each of these attributes, specific criteria were defined by which instruments should be reviewed. Similar criteria have been defined, e.g., by Bombardier and Tugwell [13], Andresen [14], and McDowell and Jenkinson [15]. What is often lacking in these criteria, however, are explicit criteria for what constitutes good measurement properties. For example, for the assessment of validity it is often recommended that hypotheses about expected results should be tested, but no criteria have been defined for how many hypotheses should be confirmed to justify that a questionnaire has good validity. No criteria have been defined for what constitutes good agreement (acceptable measurement error), good responsiveness, or good interpretability, and no criteria have been defined for the required sample size of studies assessing measurement properties.

As suggested by the SAC [12], we took on the challenge to further discuss and refine the available quality criteria for studies on the development and evaluation of health status questionnaires, including explicit criteria for the following measurement properties: (1) content validity, (2) internal consistency, (3) criterion validity, (4) construct validity, (5) reproducibility, (6) responsiveness, (7) floor and ceiling effects, and (8) interpretability. We used our criteria in two systematic reviews comparing the measurement properties of questionnaires for shoulder disability [1] and for visual functioning [4], and revised them based on our experiences in these reviews. Our criteria can also be used to detect shortcomings and gaps in knowledge of measurement properties, and to design validation studies.

In this article we define our quality criteria for measurement properties, discuss the difficult and sometimes arbitrary choices we made, and indicate future challenges. We emphasize that, just like the criteria offered by the SAC and others, our criteria are open to further discussion and refinement. Our aim is to contribute to the development of explicit quality criteria for the design, methods, and outcomes of studies on the development and evaluation of health status questionnaires.

Section snippets

Content validity

Content validity examines the extent to which the concepts of interest are comprehensively represented by the items in the questionnaire [16]. To be able to rate the quality of a questionnaire, authors should provide a clear description of the following aspects regarding the development of a questionnaire:

  • Measurement aim of the questionnaire, i.e., discriminative, evaluative, or predictive [17]. The measurement aim is important, because different items may be valid for different aims. For

Internal consistency

Internal consistency is a measure of the extent to which items in a questionnaire (sub)scale are correlated (homogeneous), thus measuring the same concept. Internal consistency is an important measurement property for questionnaires that intend to measure a single underlying concept (construct) by using multiple items. In contrast, for questionnaires in which the items are merely different aspects of a complex clinical phenomenon that do not have to be correlated, such as in the Apgar Scale [20]

Criterion validity

Criterion validity refers to the extent to which scores on a particular instrument relate to a gold standard. We give a positive rating for criterion validity if convincing arguments are presented that the used standard really is “gold” and if the correlation with the gold standard is at least 0.70.

Construct validity

Construct validity refers to the extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured [17], [19]. Construct validity should be assessed by testing predefined hypotheses (e.g., about expected correlations between measures or expected differences in scores between “known” groups). These hypotheses need to be as specific as possible. Without specific

Reproducibility

Reproducibility concerns the degree to which repeated measurements in stable persons (test–retest) provide similar answers. We believe that it is important to make a distinction between reliability and agreement [29], [30]. Agreement concerns the absolute measurement error, i.e., how close the scores on repeated measures are, expressed in the unit of the measurement scale at issue. Small measurement error is required for evaluative purposes in which one wants to distinguish clinically important

Responsiveness

Responsiveness has been defined as the ability of a questionnaire to detect clinically important changes over time, even if these changes are small [37]. A large number of definitions and methods were proposed for assessing responsiveness [38]. We consider responsiveness to be a measure of longitudinal validity. In analogy to construct validity, longitudinal validity should be assessed by testing predefined hypotheses, e.g., about expected correlations between changes in measures, or expected

Floor or ceiling effects

Floor or ceiling effects are considered to be present if more than 15% of respondents achieved the lowest or highest possible score, respectively [41]. If floor or ceiling effects are present, it is likely that extreme items are missing in the lower or upper end of the scale, indicating limited content validity. As a consequence, patients with the lowest or highest possible score cannot be distinguished from each other, thus reliability is reduced. Furthermore, the responsiveness is limited

Interpretability

Interpretability is defined as the degree to which one can assign qualitative meaning to quantitative scores [42]. Investigators should provide information about what (change in) score would be clinically meaningful. Various types of information can aid in interpreting scores on a questionnaire: (1) means and SD of scores of (subgroups of) a reference population (norm values); (2) means and SD of scores of relevant subgroups of patients who are expected to differ in scores (e.g., groups with

Population-specific ratings of measurement properties

A summary of the criteria for measurement properties of health status questionnaires is presented in Table 1. Each property is rated as positive, negative, or indeterminate, depending on the design, methods, and outcomes of the study. Measurement properties differ between populations and settings. Therefore, the evaluation of all measurement properties needs to be conducted in a population and setting that is representative for the population and setting in which the questionnaire is going to

Overview table

In the final comparison of the measurement properties of different questionnaires, one has to consider all ratings together when choosing between different questionnaires. We recommend to compose a table that provides an overview of all ratings, such as the example given in Table 2. In Table 2 the results are presented from our systematic review of all questionnaires measuring disability in patients with shoulder complaints (because there is no gold standard for disability, criterion validity

Discussion

We developed quality criteria for the design, methods, and outcomes of studies on the development and evaluation of health status questionnaires. Nine measurement properties were distinguished: content validity, internal consistency, criterion validity, construct validity, reproducibility, longitudinal validity, responsiveness, floor and ceiling effects, and interpretability.

Our criteria are mostly opinion based because there is no empirical evidence in this field to support explicit quality

Future challenges

One might argue that our criteria are not discriminative enough to distinguish between good and very high-quality questionnaires. This would be important when many high-quality questionnaires are available, but in our experience, within the field of health status and health-related quality of life measurement, this is not (yet) the case. Therefore, we believe that our criteria work well to separate the wheat from the chaff. The next step would be to further refine and complete the criteria,

References (54)

  • D.E. Beaton et al.

    Evaluating changes in health status: reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders

    J Clin Epidemiol

    (1997)
  • G. Stucki et al.

    Relative responsiveness of condition-specific and generic health status measures in degenerative lumbar spinal stenosis

    J Clin Epidemiol

    (1995)
  • J.G. Wright et al.

    A comparison of different indices of responsiveness

    J Clin Epidemiol

    (1997)
  • S.D. Bot et al.

    Clinimetric evaluation of shoulder disability questionnaires: a systematic review of the literature

    Ann Rheum Dis

    (2004)
  • E.C. Jorstad et al.

    Measuring the psychological outcomes of falling: a systematic review

    J Am Geriatr Soc

    (2005)
  • G. Daker-White

    Reliable and valid self-report outcome measures in sexual (dys)function: a systematic review

    Arch Sex Behav

    (2002)
  • M.R. de Boer et al.

    Psychometric properties of vision-related quality of life questionnaires: a systematic review

    Ophthalmic Physiol Opt

    (2004)
  • B. Edwards et al.

    Quality of life instruments for caregivers of patients with cancer

    Cancer Nurs

    (2002)
  • A.M. Garratt et al.

    Patient-assessed health instrument for the knee: a structured review

    Rheumatology

    (2004)
  • P. Hallin et al.

    Spinal cord injury and quality of life measures: a review of instrument psychometric quality

    Spinal Cord

    (2000)
  • K.L. Haywood et al.

    Quality of life in older people: a structured review of generic self-assessed health instruments

    Qual Life Res

    (2005)
  • K.L. Haywood et al.

    Patient-assessed health in ankylosing spondylitis: a structured review

    Rheumatology (Oxford)

    (2005)
  • T.P. Ettema et al.

    A review of quality of life instruments used in dementia

    Qual Life Res

    (2005)
  • Scientific Advisory Committee of the Medical Outcomes Trust

    Assessing health status and quality-of-life instruments: attributes and review criteria

    Qual Life Res

    (2002)
  • C. Bombardier et al.

    Methodological considerations in functional assessment

    J Rheumatol

    (1987)
  • I. McDowell et al.

    Development standards for health measures

    J Health Serv Res Policy

    (1996)
  • G.H. Guyatt et al.

    Measuring health related quality of life

    Ann Intern Med

    (1993)
  • Cited by (7112)

    View all citing articles on Scopus
    View full text