Abstract
Purpose: The American Board of Family Medicine approved the use of a new blueprint for the Family Medicine Certification Examination, the In-Training Examination, Family Medicine Certification Longitudinal Assessment, and the Continuous Knowledge Self-Assessment. It will go into effect in January 2025. The blueprint defines the content domains for the questions on the examination and the percentage of questions in each domain. This article describes the process used to establish the percentage of questions in each domain.
Methods: A random sample of 2000 clinically active ABFM Diplomates were surveyed about the frequency and risk of patient harm associated with 202 clinical activities. The results were translated into recommended percentages of questions for each content domain.
Results: The survey response rate was 48% and the demographic data for the responders was representative of ABFM-certified, clinically active Diplomates.
Conclusions: This article demonstrates how the examination content is directly connected to the clinical activities that comprise the scope of family practice in a way that considers both how often the activities are performed and their risk of patient harm if the activity is not performed correctly.
- Certification
- Clinical Skills
- Evaluation Studies
- Family Medicine
- Licenses
- Measures
- Patient Harm
- Program Evaluation
- Psychometrics
- Quantitative Research
- Research
- Statistical Models
- Statistics
- Surveys and Questionnaires
Introduction
In September 2022, the American Board of Family Medicine (ABFM) approved the use of a new blueprint for the Family Medicine Certification - Scale (FMC-S) that will go into effect in January 2025. The FMC-S is the common scale used by ABFM to measure physician ability across the entire spectrum of family medicine. It is the basis for the Family Medicine Certification Examination (FMCE), the In-Training Examination (ITE), the Continuous Knowledge Self-Assessment (CKSA), and the Family Medicine Certification Longitudinal Assessment (FMCLA). An examination blueprint is a document that describes the specifications to which each examination is made. In its most basic form, a blueprint has 2 key elements, the content domains into which the questions are categorized and the percentage of questions from each content domain.
This article is part of a broader effort to describe the process of how the new FMC-S blueprint was created. The development of the content domain structure for the new blueprint is described in a separate article,1 but the process to determine the percentage of questions for each domain is described here. Publishing this process and the results is important because it makes public how the examination is tied to the practice of the specialty. Most of the larger medical certification boards do this when creating a new blueprint2⇓⇓⇓⇓–7 and when validating their existing blueprints.8⇓–10 Although the results of a practice analysis are usually empirical in nature, the official endorsement of a blueprint is a policy decision set by the organization’s board of directors.
Methods
The Stelter article1 laid out the following 5 content domains for the blueprint.
Acute Care & Diagnosis,
Chronic Care Management,
Emergent & Urgent Care,
Preventive Care, and
Foundations of Care.
Foundations of Care was the only content domain that was not articulated in terms of performed, observable clinical activities. Because it was not operationalized in terms of performed activities, the researchers could not ask the physicians how often they performed them. The weight for Foundations of Care was therefore based on the judgment of the ABFM board of directors without additional empirical support. The ABFM board of directors decided to set the Foundations of Care content domain at 5% of the examination.1 This study describes how the survey results were used to recommend percentages of examination questions for the remaining 4 content domains. For these domains, the baseline percentages were based on how many different clinical activities comprised each domain. The baseline was then adjusted by how often each of these activities were performed by family physicians nationally and what was the risk of patient harm associated with each of these activities.
The process of assigning recommended weights occurred in 2 stages. The first stage was the collection of frequency and risk of harm ratings from a representative sample of clinically active family physicians. The second stage was the estimation of frequency weights and risk of harm weights for each clinical activity and then using them to adjust the baseline weighting. For this reason, the methods and results sections are both subdivided into a data collection section and a content domain weight section.
Methods: Representative Data Sample
Survey
The survey was comprised of the 202 clinical activities grouped in the clinical content domains as specified in a previous study.1 For each activity, respondents were instructed to provide 2 ratings. The first was to rate how often they expect to perform the activity using a 5-point Likert11 scale (Daily, Weekly, Monthly, Once a year, Rarely). The second was to rate the general level of risk to the patient if the condition is misdiagnosed or not managed properly, using a 4-point Likert11 scale (Minimally, Moderately, Considerable, Extreme).
Sampling Frame
A population is the entire group about which we wish to make inferences, but it is often difficult identify every member of the population. To draw a random sample from a population, a large roster, a sampling frame, is used to represent the population and a sample is randomly selected from it.12 To gather information about what clinically active, ABFM certified family physicians do in practice, the sampling frame was defined as all currently certified ABFM Diplomates, who had indicated that they were clinically active and were initially certified before 2021. Diplomates certified in 2021 were excluded to prevent their notion of practice in residency from contaminating their perception of postresidency independent practice. Diplomates who had participated in earlier pilot studies (n = 69) of this survey were also excluded. Finally, anyone who did not have a valid e-mail on file or anyone who had requested that we not send them e-mails were likewise excluded. There were 89,188 family physicians who met these criteria.
ABFM has some basic demographic information on almost all Diplomates. These variables include age, year of initial certification, race, gender. ethnicity, degree (MD/DO), and international medical graduate status. We used these variables to describe the population of interest.
Sample
From this sampling frame, we randomly selected 2000 Diplomates and invited them to take the survey. Before sending the invitations, the composition of the invitation sample was compared with the composition of the sampling frame to verify that the sample was representative of the sampling frame on the known demographic variables.
Survey Administration
Invitations to take the survey were emailed to selected Diplomates on June 2, 2022. The invitation included a link to a video from the ABFM CEO that explained the purpose and importance of the survey. The survey was expected to take 60 to 90 minutes to complete, and Diplomates would be provided 10 Continuous Certification Points toward their certification and an honorarium of $300, if they completed the survey by the deadline. Reminders were sent to invitees who had not yet completed the survey when there was only 1 week remaining and when there were only 2 days remaining. On the day of the deadline, a notification of a 1-week extension was sent to those who had not yet completed survey. The survey deadline was extended twice to enhance the response rate.
Institutional Review Board
The procedures used in this study were reviewed by senior ABFM executive staff to ensure that ABFM privacy policies were not being violated. In addition, the data were deemed IRB-exempt by the American Academy of Family Physicians Institutional Review Board.
Non-Response Bias
Nonresponse occurs when the invited survey participants are unwilling or unable to answer a question or the entire survey. However, it is only a concern if there is a systematic difference in the responses across responders and nonresponders. For example, consider opinion polls on polarizing topics. People with very strong opinions are often more likely to respond that those who are indifferent.
So, do we think that physicians who respond to this survey will rate the frequency and risk of patient harm associated with these clinical activities differently than those who do not respond? We suspect that differences between responders and nonresponders might be related to how much free time they have, but not related to their understanding of risk of harm to patients or how often they perform these clinical activities. Nevertheless, we compared the representativeness of the sample of responders with the sampling frame on the available demographic variables as a test of systematic bias.
Methods: Content Domain Weights
The first step toward creating interval scale weight adjustments for risk of patient harm and for frequency of occurrence is to model the data and conduct quality control checks to identify grossly misfitting responses. The frequency and risk-of-harm ratings from the survey were separately modeled and analyzed using a Rasch rating scale13 to generate a risk of patient harm scale (Index of Harm, IoH) and a frequency of occurrence scale (Frequency Index, FI). With a Rasch rating scales, the same rating scale structure is imposed across all clinical activities within the scale. Winsteps,14 Rasch measurement calibration software, was used to estimate these item parameters. This process used the ordinal scale ratings to create an interval scale metric15 on which the clinical activities were placed with regard to their risk of patient harm or their frequency of occurrence, respectively. The raters were also placed on this scale based on their baseline tendency to rate IoH and FI higher or lower than average. This is used later for rater quality control.
Using Rasch models to create interval scales such as the IoH and FI from physicians’ ratings is not dependent on the baseline tendency of the raters to rate activities high or low even under fairly extreme conditions.16,17 Rasch models use the raters to answer how frequently each activity is performed (or the risk of harm associated with each activity) relative to all the other activities. This process removes the differences in individual rater’s baseline tendency to rate the activities as high or low. As a result, if the activity weights based on raters who gave overall lower ratings (less frequent) are compared with the activity weights based on raters who gave overall higher ratings (more frequent), the weights are expected to be the same within appropriate stochastic limits. Deviations from this expectation can be detected using fit statistics. We used mean square fit (MNSQ) statistics to estimate the magnitude of deviation and t standardized fit statistics (zSTD) to estimate the probability that the observed deviation would have occurred by chance.
Rater Quality Control
Occasionally, a few raters will not take their responsibility to rate the activities seriously. To protect against this, a rater-exclusion criteria was defined as any rater whose responses grossly misfit the expectations of the model (outfit mean square statistic greater than or equal to 2.75) on the risk-of-harm questions. Both the frequency and risk of harm datasets were rerun excluding these extremely misfitting raters.
The frequency and risk-of-harm ratings from the survey were analyzed separately using a Rasch rating scale model13 to generate the FI and IoH for each activity. More specifically, it was modeled as:
where,
Pnik is the probability that Rater n when rating Activity i would select Rating Category k,
Pni(k-1) is the probability that the rater would select Rating Category k-1,
Bn is the severity of Rater n,
Di is the difficulty of endorsing Activity i,
Fk is the difficulty of endorsing Rating Category k relative to Rating Category k-1, where the categories are numbered 0, m.
Adjusting the Weights for Each Clinical Activity
The baseline empirically recommended percentage of questions for each of the 4 activity-based Blueprint domains is based on the number of clinical activities within each Blueprint domain with each activity being equally weighted. The survey was administered to a random sample of Diplomates so that the weight of each clinical activity could weighted by both how often it was performed and what the risk of patient harm associated with it. The goal was to generate 3 different sets of clinical activity weights in addition to the baseline in which all activities were equally weighted. The 3 sets of clinical activity weights modify the baseline weight to adjust for the frequency with which it is performed (Frequency Index, FI), the risk of patient harm (Index of Harm, IoH), the combination of FI and IoH.
To create the FI weights for the clinical activities, the frequency data from the completed surveys were Rasch-analyzed to create a scale of relative frequency. This scale has an arbitrary center origin of zero, so a constant was added to all the weights to make them all positive numbers. Next, the constant-adjusted FI weights were normalized by dividing each constant-adjusted FI weight by the sum of all the constant-adjusted FI weights. They were then multiplied by 100 which made the normalized FI weights sum to 100. This was repeated for the IoH weights using the IoH data from the survey. The combination of the FI and IoH weights were simply the average of the 2 weights.
Results: Representative Data Sample
Response Rate
After 28 days, the survey window closed (Figure 1). We found that among the 2000 invited participants, there were 45 bad e-mail addresses and 4 people who became ineligible to take the survey. Of the remaining 1951 invitees, 937 had completed the survey, for a 48% response rate; however, an additional quality control check using misfit analysis identified 10 raters who seem to have answered the questions in a careless manner. Excluding these 10 raters reduced the response rate to 47.5% (Table 1).
Response rate by day for the ABFM 2022 Blueprint Survey.
ABFM 2022 Blueprint Survey Response Rate Summary
Identifying Misfitting Raters
The quality control check to identify misfitting raters compared the fit of individual rater’s responses to the risk of patient harm hierarchy. The analysis demonstrated that this hierarchy was very stable across most responders, likely because physicians have a similar sense of an activities risk of patient regardless of whether it is part of their practice or not. Responders whose responses deviated drastically from the hierarchy articulated by the rest of the sample were removed. The exclusion criterion removed any extreme outlying responder with a risk of harm outfit mean square statistic greater than or equal to 2.75, which removed 10 responders. The responses from these 10 raters appeared to reflect a response process intended to complete the survey without sensibly rating the activities. For example, 1 rater rated 199 of the 202 activities with a 3, and only 3 activities as something else. When the risk of patient harm data were analyzed excluding these 10 misfitting people, there was a small increase in how well the activities separated the raters and how well the raters separated the activities.
Conducting a similar fit analysis on the frequency ratings was considered but ultimately dismissed. Generally, physicians have a shared sense of the risk of patient harm associated with different activities regardless of whether they perform that activity often or not. The functioning of the risk of harm rating scale supports this. Conversely, an individual physician’s scope of practice can drastically vary. This makes it difficult to differentiate variability in scope of practice from careless ratings. Instead, the same 10 people who were flagged by the risk of harm analysis were also excluded in the frequency ratings analysis. The rationale for this was that if raters did not seriously rate the risk of harm for the activity, then it seems likely that their frequency ratings for the activity (which were on the same line) would also be careless.
Representativeness of the Responding Raters
A critical issue with surveys of this nature is that the raters should be representative of the sampling frame. To determine whether there was a significant difference between raters who responded and those in the sampling frame for the variables of Race, Ethnicity, Gender, IMG Status, and Degree Type, we employed χ2 tests of independence. Differences that are not statistically significant support the hypothesis that the survey responders are representative of the sampling frame on those variables. Of these variables, only race and ethnicity showed a statistically significant difference (Table 2). For these variables, it seemed that the group that did not identify their race and ethnicity were under-represented, and the white group was overrepresented. For the other groups, the responders were more representative of the sampling frame. The difference in each category between the observed proportion and the expected proportion was always less than 6%. The age of the raters was not statistically, significantly different from the those in the sampling frame, z = 0.42, P = .68. Similarly, the mean time since their initial certification was not statistically, significantly different from those in the sampling frame, z = 0.03, P = .98. The results support that the 927 raters who responded to the survey were generally representative of the sampling frame on the available demographic variables which suggests a lack of nonresponse bias at least on the available variables.
Representativeness of Survey Responders to the Sampling Frame
Results: Weighting the Content Domains
The data from the 927 completed surveys were analyzed using a Rasch rating scale model13 to create 2 calibrations for each clinical activity, 1 for relative frequency (Frequency Index, FI) and 1 for the relative risk of patient harm (Index of Harm, IoH). Creating the FI and IoH weights was accomplished using a 3-step process. First, the clinical activities were calibrated for importance, more frequently performed and more risk of harm, respectively. Second, a constant was added to make all the calibrations positive within their respective scale (FI and IoH). Third, the calibrations were normalized so that the calibrations summed to 100 (see Appendix A).
Assemble Information for the Board of Directors
The Board of Directors needed clear, concise, and relevant information to make an informed decision about endorsing a new blueprint. Boiling the information down into a single table would facilitate that discussion. For each content domain, that table will need the
Number of clinical activities and the percentage of that number out of all 202 clinical activities
Percentage of the exam that Foundations of Care should be
Percentage of the exam that each of the other 4 content domains should comprise as based upon only the frequency adjusted weights
Percentage of the exam that each of the other 4 content domains should comprise as based upon only the risk-of-harm adjusted weights
Percentage of the exam that each of the other 4 content domains should comprise as based upon equal weighting of both the risk-of-harm and frequency adjustments.
Discussion
Overview
The goal of this manuscript was to document the process used by ABFM to create more understandable content domain categories and then to ensure that the percentage of questions assigned to each content domain made the examination reflect the current scope of family practice across the nation. The blueprint survey was administered to a random sample of Diplomates and the sample of raters who chose to participate closely matched the sampling frame demographics. Ten respondents were excluded from the sample because they appeared to be responding in a manner unrelated to the requested task. Overall, the response rate was 48%.
Blue-Ribbon Panel
The recommended content domain weights were reviewed by a blue-ribbon panel to provide feedback on the reasonableness of the recommended percentages. The panel indicated that the results seemed representative of practice.
Board of Directors Decision
The Board of Directors reviewed the practice analysis results and modified the recommended percentages slightly. In the final and board-approved version of the blueprint, the percentage for each content domain was rounded to the nearest 5% from the empirical recommendation that was based on both the frequency of occurrence and the risk of patient harm (Table 3). Additional item writing and pretesting was needed before the new blueprint could be implemented in 2025, but we could conduct a trial run using the 2024 ITE. This would give ABFM an opportunity to address any unforeseen difficulties that might arise in lower stakes environment.
Blueprint Domain Percentages Based on 4 Different Weighting Schemas
Improvements
From 2006 through 2024, an organ system blueprint2 was used to draft the FMC-S examinations. The 2025 blueprint has several advantages over the 2006 blueprint. First, the 2025 blueprint permits ABFM to craft examinations that are representative of the activities that family physicians across the nation are doing. Because it directly connects the examination content to the clinical activities that family physicians across the nation are performing, it also provides ABFM with a framework to evaluate how well the breadth of content is covered over time. Second, it considers the risk of patient harm associated with the clinical activities in addition to how often it is performed. Third, it makes possible future validity studies that address how well the activities are represented over several years of examinations. Keep in mind that not all activities will be present be on each examination form, but any of them could be. Therefore, to prepare for the examination, one should consider the full scope represented by the activities roster (Appendix A).
Limitations
The first limitation is that the Foundations of Care content domain does not have any empirical support for the percentage of the examination that it represents beyond the careful judgment of the ABFM board of directors. This content domain is often more knowledge-based than rooted in observable activities. This meant that estimates of how critical these knowledge topics are or how often these knowledge topics are used could not be made to weight this domain. The policy decision to set Foundations of Care to 5% was an effort to balance the need for these important topics to be included with the intention to root the examination in empirical evidence.
The second limitation is that the practice of family medicine is always changing. Being in time of transformative change, the practice of family medicine is very likely to be influence by technological advances such as artificial intelligence and mechanisms that can facilitate telemedicine, as well as by political changes that affect regulation of medical services, such as telemedicine regulation, medical reimbursement policies, and restrictions on what services can be provided.
The third limitation is that the sampling frame was based on family physicians that were currently certified by ABFM. Because we do not have a data sharing agreement with American Osteopathic Board of Family Physicians (AOBFP), we were unable to include physicians that were certified exclusively by the AOBFP, but it is nearly certain that some of the 11,704 DOs in the sampling frame were also certified by AOBFP.
Future Directions
We will need to establish a plan to validate the percentage of questions in each content domain to ensure that it continues to reflect the current practice of family medicine. We will have the ability to create subscores based on different subsets of questions for research purposes. Although the FMC-Scale is a unidimensional scale from a population perspective, we may be able to tease out relative strengths and weaknesses of examinees based on clusters of activities. This kind of information could lead to better suggestions for what Diplomates might want to include in their study plans.
Appendix A
Notes
This article was externally peer reviewed.
Conflict of interest: None.
To see this article online, please go to: http://jabfm.org/content/38/2/330.full.
- Received for publication July 30, 2024.
- Revision received September 9, 2024.
- Accepted for publication September 16, 2024.







