Using GPT-4o to interpret patient-reported outcomes without training

Atkinson, Thomas M; Petrov, Aleksandr; Lynch, Kathleen A; George, Login S; Cracchiolo, Jennifer R; Daly, Bobby; Fessele, Kristen L; Flory, James H; Mao, Jun J; Li, Yuelin

doi:10.1093/jnci/djaf016

Advances in large language models (LLMs) such as OpenAI’s Generative Pre-trained Transformer-4o (GPT-4o) have prompted interest regarding their use in oncology. LLMs distill vast amounts of text data such as postings in online forums, scientific articles, and educational materials.¹ As these methods evolve, there is potential for leveraging their capabilities in summarizing patient-authored free text data to patient-reported outcomes (PROs).²^,³ While conventional PROs use standardized questions to broadly assess treatment- and disease-related domains in clinical trials and routine care,⁴ free text data are ideal for capturing patients’ nuanced experiences “in their own words”.⁵

However, before LLMs can be applied in coding PROs, we sought to develop, as a proof of concept, methodological procedures for unsupervised learning to automatically interpret patient-generated free text phrases without pre-training by human coders (ie, “zero-shot learning [ZSL]”), and demonstrate that LLM performance is similar to traditional human qualitative analysis.

We used archived interview transcripts (N = 19) from 9 patients, 2 caregivers, and 7 physicians (891 total participant utterances) from a study of cancer patients’ prognostic awareness that were previously coded⁶ by 2 humans into 5 themes (Figure 1). The GPT-4o model licensed to our institution is comprised of over 1 trillion parameters, pre-trained by OpenAI and provided in Microsoft Azure through a secure enterprise cloud-based offering. The ZSL procedure instructed GPT-4o to match transcript entries to these 5 themes de novo, relying solely on semantic contents. Prompt engineering experimented with optimal queries: (1) a “base” prompt included basic descriptions of each theme; (2) a “comprehensive” prompt had very detailed descriptions; and (3) “hybrid” prompts featured mixed descriptions. Performance was evaluated by precision, recall, F1-score, and Kappa reliability using human analysts as the ground truth. Additionally, duplicated utterances (112 with >90% similarity) assessed internal consistency in human analysts and GPT-4o.

Figure 1.

Confusion matrix for GPT-4o on the dataset.

Open in new tab Download slide

GPT-4o yielded acceptable performance in text classification using verbatim utterances alone; best performance was obtained with the “hybrid” prompt (56% precision, 54% recall, 55% F1-score, Κ = 0.43). The “comprehensive” and “base” prompts fared worse (57% precision, 54% recall, 54% F1-score, Κ = 0.42; 55% precision, 46% recall, 45% F1-score, Κ = 0.32, respectively). These metrics indicate better than chance classification (20%). However, longer prompts with detailed explanations and examples did not necessarily lead to the best performance. Encouragingly, internal consistency in GPT-4o nearly matched that of human coders (ie, Κ = 0.43 and 0.44).

As a necessary initial step for generalizing LLMs broadly to code PRO domains without pre-training, we have presented promising evidence that an unsupervised LLM can code patient-generated free text data. Ultimately, if a patient undergoing chemotherapy communicates with their doctor via a portal message and describes “I felt sick to my stomach…,” a further validated LLM may interpret this as signs of “mild nausea/vomiting” in real time and prompt clinicians for immediate action (eg, prescribing antiemetics) to minimize treatment-related adverse events and risk of chemotherapy interruptions. However, caution is advised, as current sub-optimal performance suggests that ZSL applications might be more suitable for initial screening, leaving uncertain classifications for human coders to detect nuanced meaning.

Acknowledgements

Role of the funder—support of investigator time and effort in relation to the design of the study, data collection, analysis, interpretation of the data, the writing of the correspondence, and the decision to submit the correspondence for publication.

Author contributions

Thomas Michael Atkinson, PhD (Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing—original draft; Writing—review & editing), Aleksandr Petrov, MS (Conceptualization; Data curation; Formal analysis; Methodology; Project administration; Software; Validation; Visualization; Writing—original draft; Writing—review & editing), Kathleen A. Lynch, MPH (Conceptualization; Data curation; Formal analysis; Methodology; Validation; Writing—original draft; Writing—review & editing), Login S. George, PhD (Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Validation; Writing—original draft; Writing—review & editing), Jennifer R. Cracchiolo, MD (Conceptualization; Formal analysis; Investigation; Methodology; Validation; Writing—original draft; Writing—review & editing), Bobby Daly, MD (Conceptualization; Formal analysis; Investigation; Methodology; Validation; Writing—original draft; Writing—review & editing), Kristen L. Fessele, PhD, RN (Conceptualization; Formal analysis; Investigation; Methodology; Validation; Writing—original draft; Writing—review & editing), James H. Flory, MD (Conceptualization; Investigation; Methodology; Validation; Writing—original draft; Writing—review & editing), Jun J. Mao, MD (Conceptualization; Formal analysis; Investigation; Methodology; Validation; Writing—original draft; Writing—review & editing), Yuelin Li, PhD (Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing—original draft; Writing—review & editing).

Funding

This research was funded in part through a National Institutes of Health/National Cancer Institute Cancer Center Support Grant P30 CA008748-58, which provides partial support for the Patient-Reported Outcomes, Community Engagement, and Language Core Facility used in this investigation. The content of this research is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.

Conflicts of interest

The authors have no conflicts of interest to disclose.

Data availability

De-identified data can be provided upon request to the corresponding author.

References

1

Bitterman

DS

,

Downing

A

,

Maues

J

,

Lustberg

M.

Promise and perils of large language models for cancer survivorship and supportive care

.

J Clin Oncol

.

2024

;

42

:

1607

-

1611

.

2

Chen

Z

,

Wang

Q

,

Sun

Y

, et al.

Chat-ePRO: development and pilot study of an electronic patient-reported outcomes system based on ChatGPT

.

J Biomed Inform

.

2024

;

154

:

104651

.

3

Lam

MS

,

Teoh

J

,

Landay

JA

, et al. Concept induction: analyzing unstructured text with high-level concepts using LLooM. In: Proceedings of the CHI Conference on Human Factors in Computing Systems,

2024

:

1

-

28

.

4

Basch

E

,

Schrag

D

,

Henson

S

, et al.

Effect of electronic symptom monitoring on patient-reported outcomes among patients with metastatic cancer: a randomized clinical trial

.

JAMA

.

2022

;

327

:

2413

-

2422

.

5

Chung

AE

,

Shoenbill

K

,

Mitchell

SA

, et al.

Patient free text reporting of symptomatic adverse events in cancer clinical research using the National Cancer Institute's Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE)

.

J Am Med Inform Assoc

.

2019

;

26

:

276

-

285

.

6

George

LS

,

Polacek

LC

,

Lynch

K

, et al.

Reconciling the prospect of disease progression with goals and expectations: development and validation of a measurement model in advanced cancer

.

Psychooncology

.

2022

;

31

:

902

-

910

.

© The Author(s) 2025. Published by Oxford University Press. All rights reserved. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Download all slides

Month:	Total Views:
February 2025	117
March 2025	57
April 2025	177
May 2025	44

Article Contents

Using GPT-4o to interpret patient-reported outcomes without training

Acknowledgements

Author contributions

Funding

Conflicts of interest

Data availability

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Using GPT-4o to interpret patient-reported outcomes without training Free

Acknowledgements

Author contributions

Funding

Conflicts of interest

Data availability

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

Using GPT-4o to interpret patient-reported outcomes without training