ORIGINAL RESEARCH
Richard Guthmann, MD, MPH; Robert Martin, DO; Erin Lee, EdD; Christopher Boisselle, MD
Corresponding Author: Richard Guthmann, MD, MPH; Advocate Illinois Masonic Family Medicine Residency; University of Illinois at Chicago.
Email: Rick.Guthmann@aah.org
DOI: 10.3122/jabfm.2025.250401R1
Keywords: Clinical Decision-Making, Clinical Decision Support, Evidence-Based Medicine, Family Medicine, Large Language Models, Natural Language Processing
Dates: Submitted: 10-14-2025; Revised: 12-22-2025; Accepted: 02-02-2026
Status: In Press.
INTRODUCTION: Clinicians require concise, accurate summaries of new research to inform practice. Patient-Oriented Evidence that Matters (POEMs), published in American Family Physician, are a benchmark for summarizing primary literature in Family Medicine, while large language models (LLMs) offer scalable summarization but require rigorous evaluation. The objective of this study was to evaluate the accuracy and quality of summaries generated by large language models compared with expert-authored POEMs.
METHODS: In this study, we compared LLM-generated summaries (Microsoft Copilot, GPT-4o class) with 24 recent matched POEMs using a standardized prompt. Two trained raters independently scored each summary with a 13-item tool (score range 0–13), cataloged errors, recorded word counts, and indicated preferences on a 5-point scale.
RESULTS: LLM summaries outperformed POEMs in total score (mean 12.1 vs 10.6; mean difference 1.5, 95% CI 1.1–2.0; p<0.001), with similar lengths (328 vs 353 words; p=0.23). Errors occurred in fewer LLM-DOCSs (2/24) than POEMs (9/24), with a mean error score difference of 20% (95% CI 7% – 33%; p<0.001). POEMs most often missed Contextual Background and Limitations; both approaches frequently missed Clinical Applicability. Reviewer preference favored LLM-DOCS (mean 2.44 on a 1–5 scale; 95% CI 2.1–2.8).
CONCLUSIONS: An enterprise LLM, prompted in POEM style, produced accurate, low-error clinical summaries that matched or exceeded expert-edited POEMs and were generally preferred by reviewers, though further research is needed to assess broader applicability and impact. Findings support pragmatic LLM-assisted summarization and highlight the need for standardized evaluation tools and explicit prompts for clinical applicability.

