Characteristics and primary outcomes of studies that evaluated the medical accuracy of LLM-based chatbots as a health information resource.
Study ID . | Clinical Domain . | Clinical Application . | Baseline Model . | Baseline or Fine-tuned LLM . | Zero-shot or Prompt Engineered . | LLM Main Outcomes . |
---|---|---|---|---|---|---|
Ostrowska 202456 | Head and Neck | Health Information | ChatGPT-3.5; ChatGPT-4; Google Bard | Baseline | Zero-shot | ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores |
Szczesniewski 202457 | Genitourinary | Health Information | ChatGPT; Google Bard; Copilot | Baseline | Zero-shot | Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied |
Iannantuono 20247 | Pan-cancer | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard |
Lee 202358 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons |
Coskun 202359 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance |
Lum 202460 | Orthopedic | Health Information | ChatGPT-3.5; Bard | Baseline | Zero-shot | BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science |
Dennstadt 202461 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good” |
O’Hagan 202362 | Skin | Health Information | ChatGPT-4.0 | Sexu | Zero-shot | Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type |
Kuscu 202363 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | 84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses |
Deng 202464 | Breast | Health Information | ChatGPT-3.5; ChatGPT-4.0; Claude2 | Baseline | Zero-shot | ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making |
Huang 202465 | Central Nervous System | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid |
Li 202466 | Cardio-oncology | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2 | Baseline | Zero-shot | ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain |
Hanai 202467 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Prompt Engineered | Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions |
Valentini 202468 | Orthopedic | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor |
Yalamanchili 202469 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness |
Janopaul-Naylor 202470 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing) |
Musheyev 202471 | Genitourinary | Health Information | ChatGPT-3.5; Chat Sonic; Microsoft Bing AI | Baseline | Zero-shot | AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low |
Hermann 202372 | Gynecologic | Health Information | ChatGPT 3.5 | Baseline | Zero-shot | Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question |
Davis 202373 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2) |
Szczesniewski 202374 | Genitourinary | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5 |
Walker 202375 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25) |
Patel 202376 | Pan-cancer | Health Information | ChatGPT-3.5, Curie, Babbage, Ada, Davinci | Baseline | Zero-shot | ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2% |
Rahsepar 202377 | Lung | Health Information | ChatGPT-3.5; Google Bard | Baseline | Zero-shot | ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect |
Kassab 202378 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Responses were accurate 86% of the time; 14% were innacurate; 0% harmful |
Gortz 2023 | Genitourinary | Health Information | SAP Conversational AI | Fine tuned | Zero-shot | 78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings |
Koroglu 2023 | Thyroid | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable |
Huang 202379 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Prompt Engineered | Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness |
Holmes 202380 | Pan-cancer | Health Information | ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ | Baseline | Prompt Engineered | ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering |
Yeo 202381 | Liver | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74% |
Study ID . | Clinical Domain . | Clinical Application . | Baseline Model . | Baseline or Fine-tuned LLM . | Zero-shot or Prompt Engineered . | LLM Main Outcomes . |
---|---|---|---|---|---|---|
Ostrowska 202456 | Head and Neck | Health Information | ChatGPT-3.5; ChatGPT-4; Google Bard | Baseline | Zero-shot | ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores |
Szczesniewski 202457 | Genitourinary | Health Information | ChatGPT; Google Bard; Copilot | Baseline | Zero-shot | Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied |
Iannantuono 20247 | Pan-cancer | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard |
Lee 202358 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons |
Coskun 202359 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance |
Lum 202460 | Orthopedic | Health Information | ChatGPT-3.5; Bard | Baseline | Zero-shot | BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science |
Dennstadt 202461 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good” |
O’Hagan 202362 | Skin | Health Information | ChatGPT-4.0 | Sexu | Zero-shot | Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type |
Kuscu 202363 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | 84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses |
Deng 202464 | Breast | Health Information | ChatGPT-3.5; ChatGPT-4.0; Claude2 | Baseline | Zero-shot | ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making |
Huang 202465 | Central Nervous System | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid |
Li 202466 | Cardio-oncology | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2 | Baseline | Zero-shot | ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain |
Hanai 202467 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Prompt Engineered | Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions |
Valentini 202468 | Orthopedic | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor |
Yalamanchili 202469 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness |
Janopaul-Naylor 202470 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing) |
Musheyev 202471 | Genitourinary | Health Information | ChatGPT-3.5; Chat Sonic; Microsoft Bing AI | Baseline | Zero-shot | AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low |
Hermann 202372 | Gynecologic | Health Information | ChatGPT 3.5 | Baseline | Zero-shot | Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question |
Davis 202373 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2) |
Szczesniewski 202374 | Genitourinary | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5 |
Walker 202375 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25) |
Patel 202376 | Pan-cancer | Health Information | ChatGPT-3.5, Curie, Babbage, Ada, Davinci | Baseline | Zero-shot | ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2% |
Rahsepar 202377 | Lung | Health Information | ChatGPT-3.5; Google Bard | Baseline | Zero-shot | ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect |
Kassab 202378 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Responses were accurate 86% of the time; 14% were innacurate; 0% harmful |
Gortz 2023 | Genitourinary | Health Information | SAP Conversational AI | Fine tuned | Zero-shot | 78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings |
Koroglu 2023 | Thyroid | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable |
Huang 202379 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Prompt Engineered | Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness |
Holmes 202380 | Pan-cancer | Health Information | ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ | Baseline | Prompt Engineered | ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering |
Yeo 202381 | Liver | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74% |
Characteristics and primary outcomes of studies that evaluated the medical accuracy of LLM-based chatbots as a health information resource.
Study ID . | Clinical Domain . | Clinical Application . | Baseline Model . | Baseline or Fine-tuned LLM . | Zero-shot or Prompt Engineered . | LLM Main Outcomes . |
---|---|---|---|---|---|---|
Ostrowska 202456 | Head and Neck | Health Information | ChatGPT-3.5; ChatGPT-4; Google Bard | Baseline | Zero-shot | ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores |
Szczesniewski 202457 | Genitourinary | Health Information | ChatGPT; Google Bard; Copilot | Baseline | Zero-shot | Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied |
Iannantuono 20247 | Pan-cancer | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard |
Lee 202358 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons |
Coskun 202359 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance |
Lum 202460 | Orthopedic | Health Information | ChatGPT-3.5; Bard | Baseline | Zero-shot | BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science |
Dennstadt 202461 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good” |
O’Hagan 202362 | Skin | Health Information | ChatGPT-4.0 | Sexu | Zero-shot | Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type |
Kuscu 202363 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | 84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses |
Deng 202464 | Breast | Health Information | ChatGPT-3.5; ChatGPT-4.0; Claude2 | Baseline | Zero-shot | ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making |
Huang 202465 | Central Nervous System | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid |
Li 202466 | Cardio-oncology | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2 | Baseline | Zero-shot | ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain |
Hanai 202467 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Prompt Engineered | Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions |
Valentini 202468 | Orthopedic | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor |
Yalamanchili 202469 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness |
Janopaul-Naylor 202470 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing) |
Musheyev 202471 | Genitourinary | Health Information | ChatGPT-3.5; Chat Sonic; Microsoft Bing AI | Baseline | Zero-shot | AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low |
Hermann 202372 | Gynecologic | Health Information | ChatGPT 3.5 | Baseline | Zero-shot | Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question |
Davis 202373 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2) |
Szczesniewski 202374 | Genitourinary | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5 |
Walker 202375 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25) |
Patel 202376 | Pan-cancer | Health Information | ChatGPT-3.5, Curie, Babbage, Ada, Davinci | Baseline | Zero-shot | ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2% |
Rahsepar 202377 | Lung | Health Information | ChatGPT-3.5; Google Bard | Baseline | Zero-shot | ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect |
Kassab 202378 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Responses were accurate 86% of the time; 14% were innacurate; 0% harmful |
Gortz 2023 | Genitourinary | Health Information | SAP Conversational AI | Fine tuned | Zero-shot | 78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings |
Koroglu 2023 | Thyroid | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable |
Huang 202379 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Prompt Engineered | Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness |
Holmes 202380 | Pan-cancer | Health Information | ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ | Baseline | Prompt Engineered | ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering |
Yeo 202381 | Liver | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74% |
Study ID . | Clinical Domain . | Clinical Application . | Baseline Model . | Baseline or Fine-tuned LLM . | Zero-shot or Prompt Engineered . | LLM Main Outcomes . |
---|---|---|---|---|---|---|
Ostrowska 202456 | Head and Neck | Health Information | ChatGPT-3.5; ChatGPT-4; Google Bard | Baseline | Zero-shot | ChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores |
Szczesniewski 202457 | Genitourinary | Health Information | ChatGPT; Google Bard; Copilot | Baseline | Zero-shot | Each chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied |
Iannantuono 20247 | Pan-cancer | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard |
Lee 202358 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons |
Coskun 202359 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance |
Lum 202460 | Orthopedic | Health Information | ChatGPT-3.5; Bard | Baseline | Zero-shot | BARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science |
Dennstadt 202461 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good” |
O’Hagan 202362 | Skin | Health Information | ChatGPT-4.0 | Sexu | Zero-shot | Models were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type |
Kuscu 202363 | Head and Neck | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | 84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses |
Deng 202464 | Breast | Health Information | ChatGPT-3.5; ChatGPT-4.0; Claude2 | Baseline | Zero-shot | ChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making |
Huang 202465 | Central Nervous System | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid |
Li 202466 | Cardio-oncology | Health Information | ChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2 | Baseline | Zero-shot | ChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain |
Hanai 202467 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Prompt Engineered | Both bots recommended non-pharmacological interventions significantly more than pharmacological interventions |
Valentini 202468 | Orthopedic | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | The median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor |
Yalamanchili 202469 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | LLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness |
Janopaul-Naylor 202470 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Average scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing) |
Musheyev 202471 | Genitourinary | Health Information | ChatGPT-3.5; Chat Sonic; Microsoft Bing AI | Baseline | Zero-shot | AI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low |
Hermann 202372 | Gynecologic | Health Information | ChatGPT 3.5 | Baseline | Zero-shot | Score of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question |
Davis 202373 | Genitourinary | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | 14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2) |
Szczesniewski 202374 | Genitourinary | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Treatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5 |
Walker 202375 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Zero-shot | Median EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25) |
Patel 202376 | Pan-cancer | Health Information | ChatGPT-3.5, Curie, Babbage, Ada, Davinci | Baseline | Zero-shot | ChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2% |
Rahsepar 202377 | Lung | Health Information | ChatGPT-3.5; Google Bard | Baseline | Zero-shot | ChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect |
Kassab 202378 | Pan-cancer | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Responses were accurate 86% of the time; 14% were innacurate; 0% harmful |
Gortz 2023 | Genitourinary | Health Information | SAP Conversational AI | Fine tuned | Zero-shot | 78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings |
Koroglu 2023 | Thyroid | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Questions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable |
Huang 202379 | Pan-cancer | Health Information | ChatGPT-4.0 | Baseline | Prompt Engineered | Test: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness |
Holmes 202380 | Pan-cancer | Health Information | ChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZ | Baseline | Prompt Engineered | ChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering |
Yeo 202381 | Liver | Health Information | ChatGPT-3.5 | Baseline | Zero-shot | Cirrhosis question accuracy: 79.1%; HCC question accuracy: 74% |
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.