ORIGINAL RESEARCH
Ting Wang, PhD; David W. Price, MD; Keith Stelter, MD; Deanna Bowden, BS; Amanda Dawahare, MA; Andrew W. Bazemore, MD, MPH
Corresponding Author: Ting Wang, PhD; American Board of Family Medicine.
Email: twang@theabfm.org
DOI: 10.3122/jabfm.2025.250350R2
Keywords: Artificial Intelligence, Certification, Family Medicine, Examination Questions, Large Language Models, Medical Education
Dates: Submitted: 09-04-2025; Revised: 12-16-2025; 01-02-2026; Accepted: 02-02-2026
Status: In Press.
BACKGROUND: Generative artificial intelligence offers potential to accelerate the development of "clone items"—questions assessing the same testing point with varied surface features. The American Board of Family Medicine (ABFM) conducted a structured evaluation of GPT-4o’s utility within certification context to determine if LLMs can enhance item-writing efficiency.
METHODS: Over nine months (2024–2025), ABFM used GPT-4o to generate 25 clone items from validated sources. The methodology included eight iterative prompting cycles and a multi-stage expert review. Reviewers assessed clinical accuracy, distractor plausibility, rationale quality, and reference appropriateness.
RESULTS: While 23 items were eventually deemed usable, all required extensive human editing. Systematic failures included hallucinated or inaccessible references (22/25), clinically implausible stems (18/25), and rationales lacking necessary clinical nuance (16/25). The review and validation burden outweighed the efficiency gains of automated drafting.
DISCUSSION: Findings suggest that LLMs are currently unsuitable for autonomous item authorship. They are best utilized as assistive tools for drafting preliminary vignettes. Future integration requires medically specialized models and automated evidence-validation platforms.
CONCLUSION: GPT-4o-based generation did not meet thresholds for standalone use. Results underscore the necessity of augmented, rather than automated, assessment strategies where rigorous human oversight ensures clinical fidelity and validity.

