Skip to main content

Main menu

  • HOME
  • ARTICLES
    • Current Issue
    • Archives
    • Special Collections
    • Abstracts In Press
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
  • CLASSIFIEDS
  • Other Publications
    • abfm

User menu

Search

  • Advanced search
American Board of Family Medicine
  • Other Publications
    • abfm
American Board of Family Medicine

American Board of Family Medicine

Advanced Search

  • HOME
  • ARTICLES
    • Current Issue
    • Archives
    • Special Collections
    • Abstracts In Press
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
  • CLASSIFIEDS
  • JABFM on Bluesky
  • JABFM On Facebook
  • JABFM On Twitter
  • JABFM On YouTube

Lessons Learned in Evaluating Generative Artifical Intelligence (AI) for Clone Item Development in Family Medicine Certification Assessment

ORIGINAL RESEARCH

Ting Wang, PhD; David W. Price, MD; Keith Stelter, MD; Deanna Bowden, BS; Amanda Dawahare, MA; Andrew W. Bazemore, MD, MPH

Corresponding Author: Ting Wang, PhD; American Board of Family Medicine.

Email: twang@theabfm.org

DOI: 10.3122/jabfm.2025.250350R2

Keywords: Artificial Intelligence, Certification, Family Medicine, Examination Questions, Large Language Models, Medical Education

Dates: Submitted: 09-04-2025; Revised: 12-16-2025; 01-02-2026; Accepted: 02-02-2026  

Status: In Press.

BACKGROUND: Generative artificial intelligence offers potential to accelerate the development of "clone items"—questions assessing the same testing point with varied surface features. The American Board of Family Medicine (ABFM) conducted a structured evaluation of GPT-4o’s utility within certification context to determine if LLMs can enhance item-writing efficiency.

METHODS: Over nine months (2024–2025), ABFM used GPT-4o to generate 25 clone items from validated sources. The methodology included eight iterative prompting cycles and a multi-stage expert review. Reviewers assessed clinical accuracy, distractor plausibility, rationale quality, and reference appropriateness.

RESULTS: While 23 items were eventually deemed usable, all required extensive human editing. Systematic failures included hallucinated or inaccessible references (22/25), clinically implausible stems (18/25), and rationales lacking necessary clinical nuance (16/25). The review and validation burden outweighed the efficiency gains of automated drafting.

DISCUSSION: Findings suggest that LLMs are currently unsuitable for autonomous item authorship. They are best utilized as assistive tools for drafting preliminary vignettes. Future integration requires medically specialized models and automated evidence-validation platforms.

CONCLUSION: GPT-4o-based generation did not meet thresholds for standalone use. Results underscore the necessity of augmented, rather than automated, assessment strategies where rigorous human oversight ensures clinical fidelity and validity.

ABSTRACTS IN PRESS

Navigate

  • Home
  • Current Issue
  • Past Issues

Authors & Reviewers

  • Info For Authors
  • Info For Reviewers
  • Submit A Manuscript/Review

Other Services

  • Get Email Alerts
  • Classifieds
  • Reprints and Permissions

Other Resources

  • Forms
  • Contact Us
  • ABFM News

© 2026 American Board of Family Medicine

Powered by HighWire