Skip to main content

Main menu

  • HOME
  • ARTICLES
    • Current Issue
    • Ahead of Print
    • Archives
    • Abstracts In Press
    • Special Issue Archive
    • Subject Collections
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
    • Editors' Blog
  • CLASSIFIEDS
  • Other Publications
    • abfm

User menu

  • Log out

Search

  • Advanced search
American Board of Family Medicine
  • Other Publications
    • abfm
  • Log out
American Board of Family Medicine

American Board of Family Medicine

Advanced Search

  • HOME
  • ARTICLES
    • Current Issue
    • Ahead of Print
    • Archives
    • Abstracts In Press
    • Special Issue Archive
    • Subject Collections
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
    • Editors' Blog
  • CLASSIFIEDS
  • JABFM on Bluesky
  • JABFM On Facebook
  • JABFM On Twitter
  • JABFM On YouTube

Improved Performance of GPT-4 on the Family Medicine In-Training Examination and Future Implications

ORIGINAL RESEARCH

Ting Wang, PhD; Arch G. Mainous III, PhD; Keith Stelter, MD; Thomas R. O’Neill, PhD; Warren P. Newton, MD, MPH

Corresponding Author: Ting Wang, PhD; American Board of Family Medicine

Email: twang@theabfm.org

DOI: 10.3122/jabfm.2023.230433R1

Keywords: Continuing Education, Family Medicine, Medical Education

Dates: Submitted: 11-27-2023; Revised: 02-09-2024; Accepted: 02-12-2024   

AHEAD OF PRINT: |HTML| |PDF|  FINAL PUBLICATION: |HTML| |PDF|


OBJECTIVE: In this study, we sought to comprehensively evaluate GPT-4 (Generative Pre-trained Transformer)’s performance on the 2022 American Board of Family Medicine’s (ABFM) In-Training Exam (ITE), compared to its predecessor, GPT-3.5, and the national family residents’ performance on the same exam.

METHODS:  We utilized both quantitative and qualitative analyses. First, a quantitative analysis was employed to evaluate the model's performance metrics using zero-shot prompt (where only exam questions were provided without any additional information). Following this, qualitative analysis was executed to understand the nature of the model's responses, the depth of its medical knowledge, and its ability to comprehend contextual or new information through chain-of-thoughts prompts (interactive conversation) with the model.

RESULTS: This study demonstrated that GPT-4 made significant improvement in accuracy compared to GPT-3.5 over a four-month interval between their respective release dates. The correct percentage with zero-shot prompt increased from 56% to 84%, which translates to a scaled score growth from 280 to 690, a 410-point increase. Most notably, further chain-of-thought investigation revealed GPT-4’s ability to integrate new information and make self-correction when needed.

CONCLUSIONS: In this study, GPT-4 has demonstrated notably high accuracy, as well as rapid reading and learning capabilities. These results are consistent with previous research indicating GPT-4's significant potential to assist in clinical decision-making. Furthermore, the study highlights the essential role of physicians' critical thinking and lifelong learning skills, particularly evident through the analysis of GPT-4's incorrect responses. This emphasizes the indispensable human element in effectively implementing and utilizing AI technologies in medical settings.

ABSTRACTS IN PRESS

Navigate

  • Home
  • Current Issue
  • Past Issues

Authors & Reviewers

  • Info For Authors
  • Info For Reviewers
  • Submit A Manuscript/Review

Other Services

  • Get Email Alerts
  • Classifieds
  • Reprints and Permissions

Other Resources

  • Forms
  • Contact Us
  • ABFM News

© 2025 American Board of Family Medicine

Powered by HighWire