ORIGINAL RESEARCH
Ting Wang, PhD; Arch G. Mainous III, PhD; Keith Stelter, MD; Thomas R. O’Neill, PhD; Warren P. Newton, MD, MPH
Corresponding Author: Ting Wang, PhD; American Board of Family Medicine
Email: twang@theabfm.org
DOI: 10.3122/jabfm.2023.230433R1
Keywords: Continuing Education, Family Medicine, Medical Education
Dates: Submitted: 11-27-2023; Revised: 02-09-2024; Accepted: 02-12-2024
AHEAD OF PRINT: |HTML| |PDF| FINAL PUBLICATION: |HTML| |PDF|
OBJECTIVE: In this study, we sought to comprehensively evaluate GPT-4 (Generative Pre-trained Transformer)’s performance on the 2022 American Board of Family Medicine’s (ABFM) In-Training Exam (ITE), compared to its predecessor, GPT-3.5, and the national family residents’ performance on the same exam.
METHODS: We utilized both quantitative and qualitative analyses. First, a quantitative analysis was employed to evaluate the model's performance metrics using zero-shot prompt (where only exam questions were provided without any additional information). Following this, qualitative analysis was executed to understand the nature of the model's responses, the depth of its medical knowledge, and its ability to comprehend contextual or new information through chain-of-thoughts prompts (interactive conversation) with the model.
RESULTS: This study demonstrated that GPT-4 made significant improvement in accuracy compared to GPT-3.5 over a four-month interval between their respective release dates. The correct percentage with zero-shot prompt increased from 56% to 84%, which translates to a scaled score growth from 280 to 690, a 410-point increase. Most notably, further chain-of-thought investigation revealed GPT-4’s ability to integrate new information and make self-correction when needed.
CONCLUSIONS: In this study, GPT-4 has demonstrated notably high accuracy, as well as rapid reading and learning capabilities. These results are consistent with previous research indicating GPT-4's significant potential to assist in clinical decision-making. Furthermore, the study highlights the essential role of physicians' critical thinking and lifelong learning skills, particularly evident through the analysis of GPT-4's incorrect responses. This emphasizes the indispensable human element in effectively implementing and utilizing AI technologies in medical settings.