Improved Performance of GPT-4 on the Family Medicine In-Training Examination and Future Implications

ORIGINAL RESEARCH

Ting Wang, PhD; Arch G. Mainous III, PhD; Keith Stelter, MD; Thomas R. O’Neill, PhD; Warren P. Newton, MD, MPH

Corresponding Author: Ting Wang, PhD; American Board of Family Medicine

Email: twang@theabfm.org

DOI: 10.3122/jabfm.2023.230433R1

Keywords: Continuing Education, Family Medicine, Medical Education

Dates: Submitted: 11-27-2023; Revised: 02-09-2024; Accepted: 02-12-2024

OBJECTIVE: In this study, we sought to comprehensively evaluate GPT-4 (Generative Pre-trained Transformer)’s performance on the 2022 American Board of Family Medicine’s (ABFM) In-Training Exam (ITE), compared to its predecessor, GPT-3.5, and the national family residents’ performance on the same exam.

METHODS: We utilized both quantitative and qualitative analyses. First, a quantitative analysis was employed to evaluate the model's performance metrics using zero-shot prompt (where only exam questions were provided without any additional information). Following this, qualitative analysis was executed to understand the nature of the model's responses, the depth of its medical knowledge, and its ability to comprehend contextual or new information through chain-of-thoughts prompts (interactive conversation) with the model.

RESULTS: This study demonstrated that GPT-4 made significant improvement in accuracy compared to GPT-3.5 over a four-month interval between their respective release dates. The correct percentage with zero-shot prompt increased from 56% to 84%, which translates to a scaled score growth from 280 to 690, a 410-point increase. Most notably, further chain-of-thought investigation revealed GPT-4’s ability to integrate new information and make self-correction when needed.

CONCLUSIONS: In this study, GPT-4 has demonstrated notably high accuracy, as well as rapid reading and learning capabilities. These results are consistent with previous research indicating GPT-4's significant potential to assist in clinical decision-making. Furthermore, the study highlights the essential role of physicians' critical thinking and lifelong learning skills, particularly evident through the analysis of GPT-4's incorrect responses. This emphasizes the indispensable human element in effectively implementing and utilizing AI technologies in medical settings.

ABSTRACTS IN PRESS

Main menu

User menu

Search

American Board of Family Medicine

Improved Performance of GPT-4 on the Family Medicine In-Training Examination and Future Implications

ORIGINAL RESEARCH

Ting Wang, PhD; Arch G. Mainous III, PhD; Keith Stelter, MD; Thomas R. O’Neill, PhD; Warren P. Newton, MD, MPH

Navigate

Authors & Reviewers

Other Services

Other Resources