Table 2. Open in new tab Characteristics...

Study ID

Clinical Domain

Clinical Application

Baseline Model

Baseline or Fine-tuned LLM

Zero-shot or Prompt Engineered

LLM Main Outcomes

Ostrowska 2024⁵⁶

Head and Neck

Health Information

ChatGPT-3.5; ChatGPT-4; Google Bard

Baseline

Zero-shot

ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores

Szczesniewski 2024⁵⁷

Genitourinary

Health Information

ChatGPT; Google Bard; Copilot

Baseline

Zero-shot

Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied

Iannantuono 2024⁷

Pan-cancer

Health Information

ChatGPT-3.5; ChatGPT-4.0; Google Bard

Baseline

Zero-shot

ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard

Lee 2023⁵⁸

Head and Neck

Health Information

ChatGPT-4.0

Baseline

Zero-shot

ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons

Coskun 2023⁵⁹

Genitourinary

Health Information

ChatGPT-3.5

Baseline

Zero-shot

ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance

Lum 2024⁶⁰

Orthopedic

Health Information

ChatGPT-3.5; Bard

Baseline

Zero-shot

BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science

Dennstadt 2024⁶¹

Pan-cancer

Health Information

ChatGPT-3.5

Baseline

Zero-shot

94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”

O’Hagan 2023⁶²

Skin

Health Information

ChatGPT-4.0

Sexu

Zero-shot

Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type

Kuscu 2023⁶³

Head and Neck

Health Information

ChatGPT-4.0

Baseline

Zero-shot

84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses

Deng 2024⁶⁴

Breast

Health Information

ChatGPT-3.5; ChatGPT-4.0; Claude2

Baseline

Zero-shot

ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making

Huang 2024⁶⁵

Central Nervous System

Health Information

ChatGPT-4.0

Baseline

Zero-shot

ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid

Li 2024⁶⁶

Cardio-oncology

Health Information

ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2

Baseline

Zero-shot

ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain

Hanai 2024⁶⁷

Genitourinary

Health Information

ChatGPT-3.5

Baseline

Prompt Engineered

Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions

Valentini 2024⁶⁸

Orthopedic

Health Information

ChatGPT-3.5

Baseline

Zero-shot

The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor

Yalamanchili 2024⁶⁹

Pan-cancer

Health Information

ChatGPT-3.5

Baseline

Zero-shot

LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness

Janopaul-Naylor 2024⁷⁰

Pan-cancer

Health Information

ChatGPT-4.0

Baseline

Zero-shot

Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)

Musheyev 2024⁷¹

Genitourinary

Health Information

ChatGPT-3.5; Chat Sonic; Microsoft Bing AI

Baseline

Zero-shot

AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low

Hermann 2023⁷²

Gynecologic

Health Information

ChatGPT 3.5

Baseline

Zero-shot

Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question

Davis 2023⁷³

Genitourinary

Health Information

ChatGPT-3.5

Baseline

Zero-shot

14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)

Szczesniewski 2023⁷⁴

Genitourinary

Health Information

ChatGPT-4.0

Baseline

Zero-shot

Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5

Walker 2023⁷⁵

Pan-cancer

Health Information

ChatGPT-4.0

Baseline

Zero-shot

Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)

Patel 2023⁷⁶

Pan-cancer

Health Information

ChatGPT-3.5, Curie, Babbage, Ada, Davinci

Baseline

Zero-shot

ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%

Rahsepar 2023⁷⁷

Lung

Health Information

ChatGPT-3.5; Google Bard

Baseline

Zero-shot

ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect

Kassab 2023⁷⁸

Pan-cancer

Health Information

ChatGPT-3.5

Baseline

Zero-shot

Responses were accurate 86% of the time; 14% were innacurate; 0% harmful

Gortz 2023

Genitourinary

Health Information

SAP Conversational AI

Fine tuned

Zero-shot

78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings

Koroglu 2023

Thyroid

Health Information

ChatGPT-3.5

Baseline

Zero-shot

Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable

Huang 2023⁷⁹

Pan-cancer

Health Information

ChatGPT-4.0

Baseline

Prompt Engineered

Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness

Holmes 2023⁸⁰

Pan-cancer

Health Information

ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ

Baseline

Prompt Engineered

ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering

Yeo 2023⁸¹

Liver

Health Information

ChatGPT-3.5

Baseline

Zero-shot

Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74%

Study ID

Clinical Domain

Clinical Application

Baseline Model

Baseline or Fine-tuned LLM

Zero-shot or Prompt Engineered

LLM Main Outcomes

Ostrowska 2024⁵⁶

Head and Neck

Health Information

ChatGPT-3.5; ChatGPT-4; Google Bard

Baseline

Zero-shot

ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores

Szczesniewski 2024⁵⁷

Genitourinary

Health Information

ChatGPT; Google Bard; Copilot

Baseline

Zero-shot

Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied

Iannantuono 2024⁷

Pan-cancer

Health Information

ChatGPT-3.5; ChatGPT-4.0; Google Bard

Baseline

Zero-shot

ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard

Lee 2023⁵⁸

Head and Neck

Health Information

ChatGPT-4.0

Baseline

Zero-shot

ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons

Coskun 2023⁵⁹

Genitourinary

Health Information

ChatGPT-3.5

Baseline

Zero-shot

ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance

Lum 2024⁶⁰

Orthopedic

Health Information

ChatGPT-3.5; Bard

Baseline

Zero-shot

BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science

Dennstadt 2024⁶¹

Pan-cancer

Health Information

ChatGPT-3.5

Baseline

Zero-shot

94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”

O’Hagan 2023⁶²

Skin

Health Information

ChatGPT-4.0

Sexu

Zero-shot

Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type

Kuscu 2023⁶³

Head and Neck

Health Information

ChatGPT-4.0

Baseline

Zero-shot

84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses

Deng 2024⁶⁴

Breast

Health Information

ChatGPT-3.5; ChatGPT-4.0; Claude2

Baseline

Zero-shot

ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making

Huang 2024⁶⁵

Central Nervous System

Health Information

ChatGPT-4.0

Baseline

Zero-shot

ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid

Li 2024⁶⁶

Cardio-oncology

Health Information

ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2

Baseline

Zero-shot

ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain

Hanai 2024⁶⁷

Genitourinary

Health Information

ChatGPT-3.5

Baseline

Prompt Engineered

Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions

Valentini 2024⁶⁸

Orthopedic

Health Information

ChatGPT-3.5

Baseline

Zero-shot

The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor

Yalamanchili 2024⁶⁹

Pan-cancer

Health Information

ChatGPT-3.5

Baseline

Zero-shot

LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness

Janopaul-Naylor 2024⁷⁰

Pan-cancer

Health Information

ChatGPT-4.0

Baseline

Zero-shot

Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)

Musheyev 2024⁷¹

Genitourinary

Health Information

ChatGPT-3.5; Chat Sonic; Microsoft Bing AI

Baseline

Zero-shot

AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low

Hermann 2023⁷²

Gynecologic

Health Information

ChatGPT 3.5

Baseline

Zero-shot

Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question

Davis 2023⁷³

Genitourinary

Health Information

ChatGPT-3.5

Baseline

Zero-shot

14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)

Szczesniewski 2023⁷⁴

Genitourinary

Health Information

ChatGPT-4.0

Baseline

Zero-shot

Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5

Walker 2023⁷⁵

Pan-cancer

Health Information

ChatGPT-4.0

Baseline

Zero-shot

Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)

Patel 2023⁷⁶

Pan-cancer

Health Information

ChatGPT-3.5, Curie, Babbage, Ada, Davinci

Baseline

Zero-shot

ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%

Rahsepar 2023⁷⁷

Lung

Health Information

ChatGPT-3.5; Google Bard

Baseline

Zero-shot

ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect

Kassab 2023⁷⁸

Pan-cancer

Health Information

ChatGPT-3.5

Baseline

Zero-shot

Responses were accurate 86% of the time; 14% were innacurate; 0% harmful

Gortz 2023

Genitourinary

Health Information

SAP Conversational AI

Fine tuned

Zero-shot

78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings

Koroglu 2023

Thyroid

Health Information

ChatGPT-3.5

Baseline

Zero-shot

Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable

Huang 2023⁷⁹

Pan-cancer

Health Information

ChatGPT-4.0

Baseline

Prompt Engineered

Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness

Holmes 2023⁸⁰

Pan-cancer

Health Information

ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ

Baseline

Prompt Engineered

ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering

Yeo 2023⁸¹

Liver

Health Information

ChatGPT-3.5

Baseline

Zero-shot

Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74%

Table 2.

Open in new tab

Characteristics and primary outcomes of studies that evaluated the medical accuracy of LLM-based chatbots as a health information resource.

Study ID	Clinical Domain	Clinical Application	Baseline Model	Baseline or Fine-tuned LLM	Zero-shot or Prompt Engineered	LLM Main Outcomes
Ostrowska 2024⁵⁶	Head and Neck	Health Information	ChatGPT-3.5; ChatGPT-4; Google Bard	Baseline	Zero-shot	ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores
Szczesniewski 2024⁵⁷	Genitourinary	Health Information	ChatGPT; Google Bard; Copilot	Baseline	Zero-shot	Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied
Iannantuono 2024⁷	Pan-cancer	Health Information	ChatGPT-3.5; ChatGPT-4.0; Google Bard	Baseline	Zero-shot	ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard
Lee 2023⁵⁸	Head and Neck	Health Information	ChatGPT-4.0	Baseline	Zero-shot	ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons
Coskun 2023⁵⁹	Genitourinary	Health Information	ChatGPT-3.5	Baseline	Zero-shot	ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance
Lum 2024⁶⁰	Orthopedic	Health Information	ChatGPT-3.5; Bard	Baseline	Zero-shot	BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science
Dennstadt 2024⁶¹	Pan-cancer	Health Information	ChatGPT-3.5	Baseline	Zero-shot	94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”
O’Hagan 2023⁶²	Skin	Health Information	ChatGPT-4.0	Sexu	Zero-shot	Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type
Kuscu 2023⁶³	Head and Neck	Health Information	ChatGPT-4.0	Baseline	Zero-shot	84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses
Deng 2024⁶⁴	Breast	Health Information	ChatGPT-3.5; ChatGPT-4.0; Claude2	Baseline	Zero-shot	ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making
Huang 2024⁶⁵	Central Nervous System	Health Information	ChatGPT-4.0	Baseline	Zero-shot	ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid
Li 2024⁶⁶	Cardio-oncology	Health Information	ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2	Baseline	Zero-shot	ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain
Hanai 2024⁶⁷	Genitourinary	Health Information	ChatGPT-3.5	Baseline	Prompt Engineered	Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions
Valentini 2024⁶⁸	Orthopedic	Health Information	ChatGPT-3.5	Baseline	Zero-shot	The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor
Yalamanchili 2024⁶⁹	Pan-cancer	Health Information	ChatGPT-3.5	Baseline	Zero-shot	LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness
Janopaul-Naylor 2024⁷⁰	Pan-cancer	Health Information	ChatGPT-4.0	Baseline	Zero-shot	Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)
Musheyev 2024⁷¹	Genitourinary	Health Information	ChatGPT-3.5; Chat Sonic; Microsoft Bing AI	Baseline	Zero-shot	AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low
Hermann 2023⁷²	Gynecologic	Health Information	ChatGPT 3.5	Baseline	Zero-shot	Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question
Davis 2023⁷³	Genitourinary	Health Information	ChatGPT-3.5	Baseline	Zero-shot	14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)
Szczesniewski 2023⁷⁴	Genitourinary	Health Information	ChatGPT-4.0	Baseline	Zero-shot	Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5
Walker 2023⁷⁵	Pan-cancer	Health Information	ChatGPT-4.0	Baseline	Zero-shot	Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)
Patel 2023⁷⁶	Pan-cancer	Health Information	ChatGPT-3.5, Curie, Babbage, Ada, Davinci	Baseline	Zero-shot	ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%
Rahsepar 2023⁷⁷	Lung	Health Information	ChatGPT-3.5; Google Bard	Baseline	Zero-shot	ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect
Kassab 2023⁷⁸	Pan-cancer	Health Information	ChatGPT-3.5	Baseline	Zero-shot	Responses were accurate 86% of the time; 14% were innacurate; 0% harmful
Gortz 2023	Genitourinary	Health Information	SAP Conversational AI	Fine tuned	Zero-shot	78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings
Koroglu 2023	Thyroid	Health Information	ChatGPT-3.5	Baseline	Zero-shot	Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable
Huang 2023⁷⁹	Pan-cancer	Health Information	ChatGPT-4.0	Baseline	Prompt Engineered	Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness
Holmes 2023⁸⁰	Pan-cancer	Health Information	ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ	Baseline	Prompt Engineered	ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering
Yeo 2023⁸¹	Liver	Health Information	ChatGPT-3.5	Baseline	Zero-shot	Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74%

Study ID	Clinical Domain	Clinical Application	Baseline Model	Baseline or Fine-tuned LLM	Zero-shot or Prompt Engineered	LLM Main Outcomes
Ostrowska 2024⁵⁶	Head and Neck	Health Information	ChatGPT-3.5; ChatGPT-4; Google Bard	Baseline	Zero-shot	ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores
Szczesniewski 2024⁵⁷	Genitourinary	Health Information	ChatGPT; Google Bard; Copilot	Baseline	Zero-shot	Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied
Iannantuono 2024⁷	Pan-cancer	Health Information	ChatGPT-3.5; ChatGPT-4.0; Google Bard	Baseline	Zero-shot	ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard
Lee 2023⁵⁸	Head and Neck	Health Information	ChatGPT-4.0	Baseline	Zero-shot	ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons
Coskun 2023⁵⁹	Genitourinary	Health Information	ChatGPT-3.5	Baseline	Zero-shot	ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance
Lum 2024⁶⁰	Orthopedic	Health Information	ChatGPT-3.5; Bard	Baseline	Zero-shot	BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science
Dennstadt 2024⁶¹	Pan-cancer	Health Information	ChatGPT-3.5	Baseline	Zero-shot	94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”
O’Hagan 2023⁶²	Skin	Health Information	ChatGPT-4.0	Sexu	Zero-shot	Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type
Kuscu 2023⁶³	Head and Neck	Health Information	ChatGPT-4.0	Baseline	Zero-shot	84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses
Deng 2024⁶⁴	Breast	Health Information	ChatGPT-3.5; ChatGPT-4.0; Claude2	Baseline	Zero-shot	ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making
Huang 2024⁶⁵	Central Nervous System	Health Information	ChatGPT-4.0	Baseline	Zero-shot	ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid
Li 2024⁶⁶	Cardio-oncology	Health Information	ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2	Baseline	Zero-shot	ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain
Hanai 2024⁶⁷	Genitourinary	Health Information	ChatGPT-3.5	Baseline	Prompt Engineered	Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions
Valentini 2024⁶⁸	Orthopedic	Health Information	ChatGPT-3.5	Baseline	Zero-shot	The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor
Yalamanchili 2024⁶⁹	Pan-cancer	Health Information	ChatGPT-3.5	Baseline	Zero-shot	LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness
Janopaul-Naylor 2024⁷⁰	Pan-cancer	Health Information	ChatGPT-4.0	Baseline	Zero-shot	Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)
Musheyev 2024⁷¹	Genitourinary	Health Information	ChatGPT-3.5; Chat Sonic; Microsoft Bing AI	Baseline	Zero-shot	AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low
Hermann 2023⁷²	Gynecologic	Health Information	ChatGPT 3.5	Baseline	Zero-shot	Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question
Davis 2023⁷³	Genitourinary	Health Information	ChatGPT-3.5	Baseline	Zero-shot	14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)
Szczesniewski 2023⁷⁴	Genitourinary	Health Information	ChatGPT-4.0	Baseline	Zero-shot	Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5
Walker 2023⁷⁵	Pan-cancer	Health Information	ChatGPT-4.0	Baseline	Zero-shot	Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)
Patel 2023⁷⁶	Pan-cancer	Health Information	ChatGPT-3.5, Curie, Babbage, Ada, Davinci	Baseline	Zero-shot	ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%
Rahsepar 2023⁷⁷	Lung	Health Information	ChatGPT-3.5; Google Bard	Baseline	Zero-shot	ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect
Kassab 2023⁷⁸	Pan-cancer	Health Information	ChatGPT-3.5	Baseline	Zero-shot	Responses were accurate 86% of the time; 14% were innacurate; 0% harmful
Gortz 2023	Genitourinary	Health Information	SAP Conversational AI	Fine tuned	Zero-shot	78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings
Koroglu 2023	Thyroid	Health Information	ChatGPT-3.5	Baseline	Zero-shot	Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable
Huang 2023⁷⁹	Pan-cancer	Health Information	ChatGPT-4.0	Baseline	Prompt Engineered	Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness
Holmes 2023⁸⁰	Pan-cancer	Health Information	ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ	Baseline	Prompt Engineered	ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering
Yeo 2023⁸¹	Liver	Health Information	ChatGPT-3.5	Baseline	Zero-shot	Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74%

This Feature Is Available To Subscribers Only