Abstract
Introduction: Improving design, selection and implementation of appropriate clinical quality measures can reduce harms and costs of health care and improve the quality and experience of care delivery. These measures have not been evaluated for appropriateness for use in performance measurement in a systematic, reproducible, and widely accepted manner.
Methods: We defined 10 criteria for evaluation of measure appropriateness in 4 domains: Patient-centeredness of outcomes, specification of population measured and measure detail, reliable evidence that benefits likely outweigh harms, and independence from significant confounders. We applied these criteria to 24 measures under consideration for statewide use in Massachusetts in public and private incentive-based programs. We appraised each measure as Appropriate or Not Appropriate for such use.
Results: We rated 15 measures as Appropriate (62.5%). Three measures (12.5%) were considered Appropriate only if applied at a system level but not for patient-provider assessment and 6 measures (25%) were rated Not Appropriate. Reasons for designation as “Not Appropriate” included benefits not clearly outweighing harms, lack of preservation of patient autonomy, inappropriate specification of population and measure detail, confounding by locus of control, and confounding by social determinants of health.
Conclusions: Using this consensus-driven, 10-criteria methodology we were able to evaluate appropriateness of clinical quality measures. This methodology may improve measure design and inform selection of the most appropriate measures for use in quality measurement, financial incentives, and reporting.
Introduction
Family physicians are routinely evaluated using clinical quality measures. Whereas such measures may inform quality improvement (QI) activities, the stakes are higher when used for public reporting or in pay-for-performance (P4P) programs. Despite the considerable measure development effort over the past 2 decades, many quality measures remain flawed. There is no universally accepted standard for measure development, evaluation, or implementation, and there is very limited evidence that these measures lead to improved health outcomes.1,2 Implementation of flawed measures—no matter how well-intended— may have harmful and unintended consequences, including inappropriate intensification of treatment to reach arbitrary targets and opportunity costs and waste associated with a focus on measured outcomes at the expense of more important goals. Troubling ethical dilemmas are created when poorly designed measures pit the interests of doctors against those of patients.
Family physicians, burdened by clinical quality measures, often experience pressure to alter their care of patients to optimize performance on measures. Such pressures may be self-imposed or may come from employers, insurers, or as a response to public reporting. Family physicians respond to these demands in different ways. Some ignore these measures, often selectively, for example, prioritizing measures based on supporting evidence and the interests of their patients. Others may engage in “gaming” to optimize performance through various manipulations, including patient selection, adjusting diagnostic coding to include or exclude certain patients from a measure, altering close-to-target blood pressure (BP) readings, and attending to idiosyncratic timing, (eg, to ensure the A1c or BP result at target is the last 1 of the calendar year for reporting reasons).
Distinguishing appropriate from inappropriate quality measures requires criteria by which to make such judgments. Whereas flawed measures may be acceptable in certain settings (such as early stages of local QI efforts), when the stakes are high (such as in P4P programs or public reporting), the measures should satisfy more rigorous criteria. Clinical quality measures have not been evaluated for appropriateness for use in performance measurement in a systematic, transparent, reproducible, and widely accepted manner other than the American College of Physicians (ACP) review.3 This described a systematic methodology for evaluating measure validity of 86 general medicine measures. 30 (35%) were judged Not Valid and 24 (28%) as Uncertain Validity. Endorsements of measures by the National Quality Forum are influential, but their evaluation process is not openly available and reproducible.
Methods
We convened a group of family physicians (with diversity in gender, age, community, and practice setting) to create a reproducible methodology for assessing their appropriateness for use in P4P programs.
By consensus, we developed a set of 10 criteria for measure appropriateness4 (Table 1). For this pilot implementation, we classified these criteria into 4 domains: patient-centeredness, specification of outcome and population detail, evidence regarding benefits and harms, and independence from significant confounders.
Criteria for Evaluation of Appropriateness of Clinical Quality Measures
At the request of the Massachusetts Medical Society Committee on Quality, we assessed 24 measures under consideration for statewide use in public and private incentive-based programs by the Massachusetts Executive Office of Health and Human Services Quality Alignment Task Force.5 We met 3 times for a total of 8 hours. We rated each measure as Appropriate or Not Appropriate through open dialog until reaching consensus.
Results
We rated 15 measures (62.5%) as Appropriate (Table 2). Three additional measures (12.5%), which required availability and coordination of care among systems or multiple providers, were considered Appropriate only if applied at a system level but not for patient provider assessment. We rated 6 measures (25%) as Not Appropriate. Reasons for designation as “Not Appropriate” included benefits not clearly outweighing harms, lack of preservation of patient autonomy, inappropriate specification of included population and/or measure detail, confounding by inappropriate locus of control, and confounding by social determinants of health (SDOH).
Appropriateness Ratings of 24 Measures
Four of the 6 measures rated Not Appropriate fail multiple criteria (Table 2).
Three measures fail to Preserve Patient Autonomy:
Two measures with specific BP targets do not provide opportunity for patients and clinicians to weigh the potential harms of additional BP lowering against a likely small benefit for BP that is already near target. The values and preferences of patients are not elicited or respected in implementing these measures.
Breast cancer screening involves highly personal decisions. There is evidence of potential benefits, but also significant potential harms (false positives, overdiagnosis, overtreatment) that vary widely in relative importance depending on patient values. Therefore, shared decision making (SDM) is most appropriate.6 Inexplicably, this measure penalizes clinicians who engage in thoughtful collaboration with patients who then decline screening.
Two measures fail on Denominator Specification:
Two BP control measures did not exclude elderly patients (after 75 or 85 years), for whom intensive efforts to lower BP create significant risk of medication-related adverse events.
Three measures fail on Numerator Specification:
A drug dependence treatment measure requires initiation of treatment by a different clinician from that of the initial visit, or on a subsequent day from that visit. However, initiation of treatment by a primary care provider (PCP) on the same day can be clinically appropriate (even perhaps ideal).
Two outcome measures define depression remission as a PHQ-9 < 5. This is inappropriate because PHQ-9 scores of 5 to 9 are not specific for depression.7 Factors such as fatigue and insomnia produce scores > 5 in the absence of clinical depression.
Three measures are not supported by evidence that Benefits Clearly Outweigh Harms:
Although there is evidence of net benefit for BP lowering in severe hypertension, this is uncertain for patients with mild hypertension and no cardiovascular disease.8,9 The potential harms of medication-related adverse effects may outweigh the benefits of more intensive BP control, especially for older patients.10
No all-cause mortality benefit has been demonstrated for screening mammography, and the breast cancer-specific mortality benefit is extremely small (Number Needed to Screen = 1503 women aged 50 to 59 years).11 Significant harms such as false positives and overdiagnosis are well-described and highly prevalent.12 After engaging in SDM, many women reasonably conclude that the benefits of screening do not exceed the harms.
One measure fails to be within the clinician's Locus of Control:
A drug dependence measure that requires suitable follow up care defines a variety of visits as the index visit, including emergency department (ED) visits. Whether the ED arranges appropriate follow-up or referral is beyond a PCP's control. This measure would be appropriate only if applied at a system level, where there is control and influence over all visits and follow-up services.
This measure is also strongly influenced by SDOH, including the ability to afford certain types of care, access to care, access to transportation, and presence of an adequate social support network.
Discussion
Inappropriate quality measures cause harms and promote waste in health care. Among 24 measures evaluated by 10 criteria, we identified problems with 9 (38%). These results are similar to the ACP analysis, which deemed 35% of measures Not Valid. There are important differences in our methods and findings. The ACP developed a numeric scoring rubric, rating measures as Valid, Uncertain Validity, or Not Valid. In contrast to the ACP scoring rubric, our criteria were developed and applied by group consensus regarding key elements of appropriateness, without using numeric cutoffs. Distinct from the ACP, our criteria included or emphasized the elements of preservation of patient autonomy, assessment of certainty of net benefit, evaluation of resistance to gaming, and limiting potential confounding by SDOH.
The qualitative aspect of our rating process may be considered a limitation but also allows flexibility in implementation to meet local priorities. Quantitative approaches may be developed in future iterations of this approach, but care must be taken to determinate that quantification does not result in false precision or inconsistent results. We are confident that our criteria would allow other representative groups of stakeholders to reach similar conclusions to us, but demonstration of external validity is beyond the scope of this first pilot.
The ACP did not rate 12 of the 24 measures we evaluated (primarily measures relevant to children or mental health). Among the 12 measures rated by both groups, 3 measures were rated differently. We judged the breast cancer screening measure “Not Appropriate” (due to absence of certainty of net benefit and failure to preserve patient autonomy), whereas the ACP deemed this measure Valid (and does not include patient autonomy as a key criteria). 2 other measures were rated differently because the ACP includes a third rating category of rating (Uncertain Validity) whereas we do not. Results are summarized in Table 2.
Conclusion
Clinical quality measures influence behavior, especially when tied to P4P, but they may induce harms and waste through unintended consequences, especially when poorly designed or implemented. Identifying flawed clinical quality measures using specific criteria can illuminate the nature of their flaws and facilitate replacement or improvement.
Inappropriate quality measures should be retired or improved. More meaningful measures (eg, the Person-Centered Primary Care Measure13) should be developed to promote improved quality and experience of care for patients and clinicians. Family physicians are ideally positioned to influence decisions regarding selection and prioritization of performance measures, which often occur at local and regional levels. This would promote aligninment of allocation of effort and resources to achieve outcomes that truly matter to patients.
Acknowledgments
We thank the many students from Tufts University School of Medicine and Harvard Medical School who participated in previous analyses using earlier versions of the criteria set.
Notes
This article was externally peer reviewed.
Funding: None.
Conflict of interest: Dr. Alan Ehrlich is a full-time employee at EBSCO, publishers of DynaMed. Dr. Brian Alper was a fulltime employee at EBSCO during manuscript drafting and is the owner of Computable Publishing, LLC.
To see this article online, please go to: http://jabfm.org/content/35/2/427.full.
- Received for publication July 14, 2021.
- Revision received October 12, 2021.
- Accepted for publication October 14, 2021.