Skip to main content

Main menu

  • HOME
  • ARTICLES
    • Current Issue
    • Archives
    • Special Collections
    • Abstracts In Press
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
  • CLASSIFIEDS
  • Other Publications
    • abfm

User menu

Search

  • Advanced search
American Board of Family Medicine
  • Other Publications
    • abfm
American Board of Family Medicine

American Board of Family Medicine

Advanced Search

  • HOME
  • ARTICLES
    • Current Issue
    • Archives
    • Special Collections
    • Abstracts In Press
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
  • CLASSIFIEDS
  • JABFM on Bluesky
  • JABFM On Facebook
  • JABFM On Twitter
  • JABFM On YouTube

Large Language Model versus Clinician Written Summaries of Research Papers

ORIGINAL RESEARCH

Richard Guthmann, MD, MPH; Robert Martin, DO; Erin Lee, EdD; Christopher Boisselle, MD

Corresponding Author: Richard Guthmann, MD, MPH; Advocate Illinois Masonic Family Medicine Residency; University of Illinois at Chicago. 

Email: Rick.Guthmann@aah.org

DOI: 10.3122/jabfm.2025.250401R1

Keywords: Clinical Decision-Making, Clinical Decision Support, Evidence-Based Medicine, Family Medicine, Large Language Models, Natural Language Processing

Dates: Submitted: 10-14-2025; Revised: 12-22-2025; Accepted: 02-02-2026      

Status: In Press.

INTRODUCTION: Clinicians require concise, accurate summaries of new research to inform practice. Patient-Oriented Evidence that Matters (POEMs), published in American Family Physician, are a benchmark for summarizing primary literature in Family Medicine, while large language models (LLMs) offer scalable summarization but require rigorous evaluation. The objective of this study was to evaluate the accuracy and quality of summaries generated by large language models compared with expert-authored POEMs.

METHODS: In this study, we compared LLM-generated summaries (Microsoft Copilot, GPT-4o class) with 24 recent matched POEMs using a standardized prompt. Two trained raters independently scored each summary with a 13-item tool (score range 0–13), cataloged errors, recorded word counts, and indicated preferences on a 5-point scale.

RESULTS: LLM summaries outperformed POEMs in total score (mean 12.1 vs 10.6; mean difference 1.5, 95% CI 1.1–2.0; p<0.001), with similar lengths (328 vs 353 words; p=0.23). Errors occurred in fewer LLM-DOCSs (2/24) than POEMs (9/24), with a mean error score difference of 20% (95% CI 7% – 33%; p<0.001). POEMs most often missed Contextual Background and Limitations; both approaches frequently missed Clinical Applicability. Reviewer preference favored LLM-DOCS (mean 2.44 on a 1–5 scale; 95% CI 2.1–2.8).

CONCLUSIONS: An enterprise LLM, prompted in POEM style, produced accurate, low-error clinical summaries that matched or exceeded expert-edited POEMs and were generally preferred by reviewers, though further research is needed to assess broader applicability and impact. Findings support pragmatic LLM-assisted summarization and highlight the need for standardized evaluation tools and explicit prompts for clinical applicability.

ABSTRACTS IN PRESS

Navigate

  • Home
  • Current Issue
  • Past Issues

Authors & Reviewers

  • Info For Authors
  • Info For Reviewers
  • Submit A Manuscript/Review

Other Services

  • Get Email Alerts
  • Classifieds
  • Reprints and Permissions

Other Resources

  • Forms
  • Contact Us
  • ABFM News

© 2026 American Board of Family Medicine

Powered by HighWire