Abstract
Background: The potential for machine learning (ML) to enhance the efficiency of medical specialty boards has not been explored. We applied unsupervised ML to identify archetypes among American Board of Family Medicine (ABFM) Diplomates regarding their practice characteristics and motivations for participating in continuing certification, then examined associations between motivation patterns and key recertification outcomes.
Methods: Diplomates responding to the 2017 to 2021 ABFM Family Medicine continuing certification examination surveys selected motivations for choosing to continue certification. We used Chi-squared tests to examine difference proportions of Diplomates failing their first recertification examination attempt who endorsed different motivations for maintaining certification. Unsupervised ML techniques were applied to generate clusters of physicians with similar practice characteristics and motivations for recertifying. Controlling for physician demographic variables, we used logistic regression to examine the effect of motivation clusters on recertification examination success and validated the ML clusters by comparison with a previously created classification schema developed by experts.
Results: ML clusters largely recapitulated the intrinsic/extrinsic framework devised by experts previously. However, the identified clusters achieved a more equal partitioning of Diplomates into homogenous groups. In both ML and human clusters, physicians with mainly extrinsic or mixed motivations had lower rates of examination failure than those who were intrinsically motivated.
Discussion: This study demonstrates the feasibility of using ML to supplement and enhance human interpretation of board certification data. We discuss implications of this demonstration study for the interaction between specialty boards and physician Diplomates.
- ABFM
- Accreditation
- Certification
- Chi-Square Test
- Family Medicine
- Information Technology
- Logistic Regression
- Machine Learning
- Motivation
- Physicians
- Specialty Boards
Introduction
Machine learning (ML) is a type of artificial intelligence (AI) that uses computers to “systematically apply algorithms to synthesize the underlying relationships among data and information.”1 ML is said to be unsupervised when algorithms are applied to data that has not been labeled; the computer identifies patterns or relationships among data without any initial human guidance.2 Many industries use ML for purposes including text generation and analysis, speech recognition, image recognition, data analytics, and algorithmic recommendations of content to users3. Although more validation, research, quality control and oversight are needed, early and proposed uses of ML in health care include reading images,4–6 making treatment recommendations,7 patient and population risk-assessment,8,9 natural language processing,10 and various administrative tasks.11
Less explored are potential applications of ML to help medical specialty boards tailor their certification programs to individual physicians (Diplomates). ML could potentially assist medical specialty boards in several areas, including question development for summative and formative (learning) assessments, preliminary analysis of physician data and survey responses, elucidating patterns of physician engagement with and feedback on certification activities, identification of demographic and performance trends, and preliminary identification of physicians at risk of voluntarily or involuntarily failing to meet standards for continuing certification. In a previous study, Peabody and colleagues analyzed motivations for physician involvement with American Board of Family Medicine (ABFM) certification activities.12 Building on this work, we sought to apply unsupervised ML techniques to identify ABFM Diplomate motivational patterns and contrast the results with a with these previous “human-defined” classifications. To explore whether the generated clusters reflect meaningful aggregations of Diplomates, we examined the association of motivation with rates of failure on the ABFM recertification examination for both the human-defined groups and the ML clusters.
Methods
Data
We combined data from the 2017 to 2021 ABFM Family Medicine continuing certification examination surveys. Because the questionnaire is mandatory, this captures all board-certified family physicians in a yearly cohort seeking to maintain their certification. Prior research has demonstrated its reliability as a representative sample of the overall population of family physicians.13 Most Diplomates will have only completed 1 continuing certification examination survey; if applicants had multiple demographic surveys (eg, due to a previous examination failure in the time period), we took only the earliest available. Consistent with prior literature, we excluded physicians who did not provide direct outpatient continuity care. Assuming practice characteristics to be important determinants of external motivations,14 we incorporated data on practice type/ownership/size as well as demographic variables (age, gender, international medical graduate (IMG) status, and degree (MD vs DO) as confounders in our regression analysis. We merged continuing certification survey with performance data for Diplomates taking the 1-day secure continuing certification examination between 2017 and 2021.
Variables
During unsupervised learning, data were limited to 4 questions from the recertification survey. The first question was “Why are you seeking to continue your ABFM certification at this time?” This question was a select-all-that-apply format with 11 possible binary (yes/no) responses, including “other” with free-text response. Because we sought to minimize the influence of human assumptions in this study, we omitted, rather than manually reclassified, the free-text responses. The remaining variables included practice type (11 categories), ownership status (5 responses), and practice size (4 responses) (Table 1).
We limited the analysis only to ABFM Diplomates who selected the 1-day examination, as the continuous certification process (FMCLA) was not a fully approved option until 2021. We selected the year of the first examination attempt within the study period. Physicians with ongoing failed attempts that began before 2017 were excluded. We used the pass/fail determination and the standardized examination score for the first examination attempt for each Diplomate. Our outcome was whether a Diplomate failed their first attempt at continuing certification examination.
Motivational Patterns
We used unsupervised ML (unprompted by the previous human-derived classification) to identify 5 mutually exclusive clusters of Diplomate motivation. (See Appendix for details). For ease of reference, we gave each cluster a descriptive title based on its predominant characteristic.
For comparison to the ML derived clusters, we replicated a previous human-identified pattern that classified the variables a priori into intrinsic vs extrinsic motivations and subdivided the physicians into 3 groups: “Extrinsic Only,” “Intrinsic Only,” and “Mixed.”13 A physician was classified as “Extrinsic only” if only 1 or more of the following 3 options were selected: Required by my employer; Required for hospital privileges/credentialing; and/or Required by 1 or more payer/insurance company. A physician was classified as “Intrinsic only” if only 1 or more of the following 7 options were selected: Maintain professional image; Personal preference; Professional advancement; Maintain or improve patient satisfaction; Patients prefer being treated by board certified physician; Certification program helps me update my medical knowledge; and/or certification program helps me monitor or improve the quality of my patient care. All remaining physicians, who definitionally had selected at least 1 motivation from both categories (except a trivial number who selected only “Other”), were assigned to “Mixed motivation.” Because the original article did not incorporate information beyond motivation, we did not devise new rules to reassign Diplomates or further subdivide the original 3 groups into subtypes based on differences in practice type/ownership/size.
Statistical Analysis
We used Chi-squared tests to examine the difference in the proportion of Diplomates failing their first examination attempt who endorsed each of the 11 motivations. Average rates of examination failure on the initial attempt were calculated for the 3 human classification groups and the clusters generated from unsupervised ML. We then used logistic regression to examine the effect of motivation holding other demographic variables constant. Age was divided into tertials; gender, degree, and IMG status were used as binary variables.
The ML components of this analysis were conducted with R 4.2.3. All other components were conducted with Stata 17.0.
As part of routine internal program evaluation, this study was conducted with ethical approval from the American Academy of Family Physicians Institutional Review Board.
Results
ML Clusters
Table 2 shows the 5 ML identified clusters, along with the proportion of Diplomates within that cluster who endorsed a particular motivation. Highlighted motivations were used to give each cluster a descriptive title. The table also displays similar metrics for the 3 human-derived subgroups. Human sorting resulted in imbalanced cohort groupings, with more than 60% assigned to the “Mixed” group.
Table 3 shows how the practice characteristics map to each cluster. There is an association of certain practice type/sizes/ownerships with a specific cluster of extrinsic motivation. For example, almost 50% of Diplomates working at independently owned sites were assigned to Cluster 1 (Credential and Payer), but only 4% and 2% were assigned to Cluster 2 (Employer) and Cluster 3 (Predominantly Extrinsic). The opposite trend is seen for nonphysician owned sites like academic medical and Federally Qualified Health Centers. Within almost all responses to practice characteristics there were Diplomates assigned to Cluster 4 (Mixed motivations) because they had extrinsic motivations associated with their practice location but also reported a number of intrinsic motivations.
Motivation and Initial Recertifying Examination Success
Table 4 shows associations between each Diplomate survey question on motivation and failure rate for the initial attempt at the continuing certification examination. Diplomates had statistically significantly higher rates of failure if they had endorsed motivations of professional advancement (P < .001), a desire to maintain/improve patient satisfaction (P < .001), and a desire to monitor or improve patient care (P = .013). They had lower rates of failure when endorsing motivations of image (P = .041), patient preference (P < .001), and employer (P < .001), credential (P < .001), and payer (P < .001) requirements.
Figure 1 shows the adjusted probability of failing a continuing certification examination attempt after controlling for potential confounding from gender, age, IMG status, and type of degree (MD vs DO). Four of 5 ML- and 2 of 3 human-created classifications achieved a statistically significant separation from the overall rate of failure (7.2%). Diplomates clustered into “Mainly Extrinsic” or “Broadly Motivated” had a significantly lower probability of failing (4.3% (P < .001) and 5.1% (P < .001, respectively). Diplomates in the clusters of “Credential + Payer” and “Mainly Intrinsic” had statistically significantly higher rates of failure at 8.7% (P = .002) and 9.1% (P < .001), respectively. The fail rate for the “Employer” cluster was not statistically significantly different from the overall mean (7.6% vs 7.2%, P = .378). Among human-created classifications, the “Mixed” subgroup had a lower probability of failure (5.7%, P < .001) and the “Intrinsic Only” subgroup had a higher probability of failure (11.1%, P < .001). There was no statistically significant difference observed for the “Extrinsic Only” human created group (6.4% vs 7.2%, P = .129).
Discussion
This novel study explored the feasibility of a medical specialty board using unsupervised ML to gain insight from complex datasets. We identified new patterns of Diplomate motivation for participating in continuing board certification, compared the results to a previous human-defined characterization, and retrospectively associated both machine and human-derived clusters with the probability of failing the 1-day continuing certification examination.
Despite not being fed prespecified information on the difference between intrinsic and extrinsic motivations, the clusters derived from unsupervised ML align with literature that characterizes motivation as extrinsic or intrinsic.15 Three clusters were consistent with only extrinsic motivations, 1 with only intrinsic motivations, and the last had a mixture. Though the computer does much of the heavy lifting, obviating the need for a human to manually think through and create algorithms that classify each Diplomate, this is not the only possible outcome. The final analysis is still heavily influenced by human decisions such as selection of model(s) to use, variables to include, and the desired number of clusters to generate. Variations in these human decisions may lead to substantively different cluster assignments for individual Diplomates as well as greater or lower feasibility of implementing interventions commensurate with how many unique clusters are generated.
As demonstrated here, 1 major advantage of ML is the potential to achieve more balanced and granular groups, especially when working with complicated datasets containing many interrelated variables that are not always intuitive for humans to understand. Given the small number of classification categories and straightforward assignment rules, it is not surprising that roughly 60% of Diplomates were originally assigned to the “Mixed” motivation group in the article by Peabody et al.13 In this study, as shown in Table 2, many of the cells in the ML-identified clusters had values greater than 70% for individual motivations; several contain values between 80% and 93%. In contrast, only 2 cells in the human-identified clusters contain values more than 70%. This illustrates the potential of ML to generate more specific clusters than human classification.
After analysis with unsupervised learning, there were 5 clusters of approximately equal size; roughly 80% of Diplomates clustered into 1 of the 4 groups found to have statistically significantly different rates of failure relative to the baseline. We also found the human-defined “intrinsic only” group was the strongest positive predictor of recertification examination failure, even though the ML clusters seem to have done better at identifying physicians at noticeably below baseline risk. This does not imply that human-defined groups were better or worse than the ML clusters at predicting examination failure. Creating groups to predict examination failure was not a goal of the Peabody et al13 article. The ML techniques in this article were selected to test the feasibility of identifying coherent clusters of Diplomates reporting similar motivations. However, the higher rate of failure seen in the intrinsic only group suggests that there may be a threshold effect where a single extrinsic motivation is protective. Although self-determination theory and the learning literature note the importance of internal motivation for learning,16–20 these results reinforce that at least some elements of extrinsic motivation are important for learning outcomes – in other words, “stakes matter.”21–24 It seems plausible that some Diplomates with solely intrinsic motivations are at higher risk of examination failure, perhaps because they may not have immediate external consequences if they fail to recertify. If lack of extrinsic motivation is thought to be a potentially important predictor in future ML models looking at examination failure, a priori designation of extrinsic and intrinsic variables would be required. Research with ML remains an iterative process incorporating human hypotheses and interpretation of results and their implications.
Implications
This pilot illustrates the potential for certifying boards to use ML to create more tailored strategies for diplomate communication. Boards might use motivational pattern differences between Diplomates who perform well on board certification activities with those with lesser performance to develop targeted, proactive outreach and “coaching” of physicians at risk for not maintaining or losing their certification. For example, Diplomates assigned to Cluster 1 in our classification (commonly having at least partial ownership in practice and reporting motivations of only credential and/or payer requirements) might respond positively to outreach emphasizing how board certification aligns with expectations of insurance companies and hospitals.25 Someone in Cluster 5 (commonly an employee at a small independent practice who selected multiple intrinsic motivations and no extrinsic motivations) may be less interested in evidence on economic benefits of board certification and more interested in literature demonstrating how board certification can maintain and enhance the quality of care he/she is able to provide to patients.26 Medical specialty boards could seek input from their Diplomates on the content and wording of tailored communications and evaluate the effect of these tailored messages on motivation, patterns of engagement with different parts of continuing certification, and outcomes (eg, examination performance, improvement in knowledge) over time.
A broader vision for ML in continuous board certification includes facilitating a bespoke process of continuous learning and improvement. Tailored Diplomate outreach could help increase engagement and retention in continuous learning through board certification. However, motivation patterns may differ between formative/lower-stakes activities and higher stakes activities. Intrinsic motivation might drive participation or performance on formative components, whereas those who are primarily extrinsically motivated might be doing ‘just enough to get through’ and focus their energy on the summative component. Therefore, associations between the motivational patterns identified in this study and engagement in and performance on other parts of continuing certification should be explored.
Beyond motivational patterns, ML’s ability to identify patterns from complex data sets can assist specialty boards in evolving and personalizing other parts of the continuous certification process. For example, ML could help identify individual educational gaps based on self-assessment module, examination performance or real-world practice data, and suggest journal articles, conferences, or other continuing medical education (CME) activities for Diplomates to pursue. These “nudges”27 could potentially be tailored based on Diplomate motivational profiles. Applying unsupervised ML techniques to data on self-reported scope of practice, Medicare claims, or other registry data (eg, the ABFM PRIME registry) might help boards suggest areas for practice improvement to personalized individual Diplomates. Such microsegmentation of Diplomates by preferences and knowledge gaps, paired with techniques such as spaced repetition,28 could ultimately transform certification online platforms in a manner akin to those currently deployed in the language learning environment, such as Duolingo or Rosetta Stone®. MLs data analysis ability could also help boards prioritize areas for development of self-assessment, secure examination questions, or performance improvement activities. ML could, more efficiently than human analysis, identify predominant educational gaps by Diplomate geographic location that can serve as needs assessment data for organizational or state chapter CME providers.
As in society at large, authors are writing about both MLs potential to transform health care and health care education, but the ethical implications of its use.29–31 A full discussion of the ethical implications of ML in board certification is beyond the scope of this article. Briefly, however, boards should consider issues such as bias in ML algorithms, transparency of ML assisted decision making processes, implications for question development, and question and data privacy and security as they integrate ML into their processes. Boards should also consider how physicians will use AI and ML to address clinical questions in practice, and to what degree to allow or account for it on summative assessments.32,33
Limitations
Although we were able to demonstrate promising patterns in associations between ML-derived motivational clusters and examination performance, several limitations should be considered. Motivation may be a signal for other behaviors linked with (rather than a direct cause of) poor examination outcomes. For example, examination preparation or engagement with certification in general may vary across different motivational clusters. Proximity to retirement and motivation to maintain certification may be related, although we controlled for age (often colinear with time in practice) in this analysis. Using supervised learning on the full breadth of ABFM data (including data such as time spent in self-assessment modules) to proactively predict examination failure to pass an examination is an area of ongoing research. With increasing numbers of Diplomates opting for the continuous Family Medicine Certification Longitudinal Assessment in lieu of the 1-day secure examination, a parallel analysis should be conducted to see if these motivational patterns hold (or how they differ) using this different high-stakes examination format.
Our analysis was limited to family physicians engaged in outpatient continuity care. Further work is needed to assess patterns among physicians in other specialties as well as family physicians working as administrators or in focal settings [for example, emergency, hospital, urgent care, exclusively sports medicine]. We lack data to assess relationships between motivation and study behavior, and we anticipate further effort to understand how motivation relates to certification activities beyond the 1-day, secure, high-stakes recertification examination.
Outputs of ML techniques, as with traditional statistics, reflect the limitations of their underlying data. Diplomate motivation data gathered via the recertification survey is binary. Although we could determine the presence or absence of each motivation, it was not possible to determine the most important or the relative strength of individual motivations for a given individual. It is possible that some Diplomates felt extremely strongly about 1 motivation above the rest; an algorithm on data that included additional granularity about strength of motivation might result in a different cluster assignment. It is possible that some Diplomates did not feel strongly about any motivation and randomly selected just 1 because completing the question was required to advance the survey. Conversely, though there is no current benefit to reporting a large number of motivations, some Diplomates may have chosen to deliberately overstate their motivations to “virtue signal.” ML alone could not address these limitations without Diplomate discussions or changes to the Diplomate survey. If this added level of granularity is pursued, it should be done with attention to minimizing Diplomate administrative burden.
Conclusion
Large language models are already impacting clinical care delivery34,35; we anticipate a need for commensurate changes to keep the continuous certification paradigm relevant. This study demonstrates both the feasibility and potential impact of applying ML techniques to better understand physician diplomate motivations and preferences. It also reveals how such techniques can supplement and efficiently enhance human judgment in interpreting certification data. The clusters we identified with unsupervised learning may help boards better understand certification-related motivational patterns among Diplomates. It could help boards develop targeted strategies for outreach to physicians who may be at risk of failing to maintain board certification. Beyond this initial foray, we imagine how ML-derived motivational clusters, and ML techniques in general, might help boards improve physician engagement across the components of continuous specialty certification. Additional feasibility studies, incorporating Diplomate feedback, will be important to explore how ML techniques can be used to improve the development, applicability, and outcomes of board certification activities and increase the value of board certification to Diplomates.
Acknowledgments
The authors thank Zachary Morgan, MS, and Shawn Reynolds for their assistance with data access.
Appendix.
Unsupervised Machine Learning Methods Used in This Study
Because our training data consisted solely of categorical variables, we used k-medoids with Hamming distances.1 A medoid is the representative object from a cluster such that each other member of the cluster is more like that medoid than to the medoid of any other cluster. The Hamming distance between 2 Diplomates is the total number of variables where they have different responses (eg, the Hamming distance between 2 Diplomates is 1 if they have identical responses for each motivation component as well as the same category of practice ownership/size but 1 works at a staff-model health maintenance organization and the other works in a federally qualified health-center). Because we were attempting to perform a cluster analysis on thousands of applicants, we used a sampling approach to reduce computing time and avoid random-access memory storage limitations. We used the `cluster’ package in R to implement CLARA (Clustering Large Applications), which is an extension of sampling to k-medoids methods. To minimize the possibility of cherry-picking results when selecting the number of clusters to use in the final analysis, we used validated clustering approaches including examining the silhouette, gap statistic, and total sum of squares. This used the `factoextra’ package with potential cluster parameter ranging from 2 to 10 and nboot set to 500 iterations. After preliminary examination of cluster silhouettes, gap-statistics, and Within Cluster Sum of Squares to identify the optimal number of clusters, we settled on having the unsupervised learning generate 5 mutually exclusive clusters; for ease of reference and graphical display, we gave each cluster a descriptive title based on the motivations of the medoid of that cluster.
Reference
- 1.↵
Notes
This article was externally peer reviewed.
Funding: This work was internally funded by the American Board of Family Medicine. The authors report no competing interests.
Conflict of interest: None.
To see this article online, please go to: http://jabfm.org/content/37/2/279.full.
- Received for publication October 15, 2023.
- Revision received November 22, 2023.
- Accepted for publication December 4, 2023.