Table 2.

Characteristics and primary outcomes of studies that evaluated the medical accuracy of LLM-based chatbots as a health information resource.

Study IDClinical DomainClinical ApplicationBaseline ModelBaseline or Fine-tuned LLMZero-shot or Prompt EngineeredLLM Main Outcomes
Ostrowska 202456Head and NeckHealth InformationChatGPT-3.5; ChatGPT-4; Google BardBaselineZero-shotChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores
Szczesniewski 202457GenitourinaryHealth InformationChatGPT; Google Bard; CopilotBaselineZero-shotEach chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied
Iannantuono 20247Pan-cancerHealth InformationChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard
Lee 202358Head and NeckHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons
Coskun 202359GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shotChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance
Lum 202460OrthopedicHealth InformationChatGPT-3.5; BardBaselineZero-shotBARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science
Dennstadt 202461Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shot94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”
O’Hagan 202362SkinHealth InformationChatGPT-4.0SexuZero-shotModels were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type
Kuscu 202363Head and NeckHealth InformationChatGPT-4.0BaselineZero-shot84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses
Deng 202464BreastHealth InformationChatGPT-3.5; ChatGPT-4.0; Claude2BaselineZero-shotChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making
Huang 202465Central Nervous SystemHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid
Li 202466Cardio-oncologyHealth InformationChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2BaselineZero-shotChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain
Hanai 202467GenitourinaryHealth InformationChatGPT-3.5BaselinePrompt EngineeredBoth bots recommended non-pharmacological interventions significantly more than pharmacological interventions
Valentini 202468OrthopedicHealth InformationChatGPT-3.5BaselineZero-shotThe median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor
Yalamanchili 202469Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotLLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness
Janopaul-Naylor 202470Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotAverage scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)
Musheyev 202471GenitourinaryHealth InformationChatGPT-3.5; Chat Sonic; Microsoft Bing AIBaselineZero-shotAI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low
Hermann 202372GynecologicHealth InformationChatGPT 3.5BaselineZero-shotScore of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question
Davis 202373GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shot14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)
Szczesniewski 202374GenitourinaryHealth InformationChatGPT-4.0BaselineZero-shotTreatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5
Walker 202375Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotMedian EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)
Patel 202376Pan-cancerHealth InformationChatGPT-3.5, Curie, Babbage, Ada, DavinciBaselineZero-shotChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%
Rahsepar 202377LungHealth InformationChatGPT-3.5; Google BardBaselineZero-shotChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect
Kassab 202378Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotResponses were accurate 86% of the time; 14% were innacurate; 0% harmful
Gortz 2023GenitourinaryHealth InformationSAP Conversational AIFine tunedZero-shot78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings
Koroglu 2023ThyroidHealth InformationChatGPT-3.5BaselineZero-shotQuestions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable
Huang 202379Pan-cancerHealth InformationChatGPT-4.0BaselinePrompt EngineeredTest: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness
Holmes 202380Pan-cancerHealth InformationChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZBaselinePrompt EngineeredChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering
Yeo 202381LiverHealth InformationChatGPT-3.5BaselineZero-shotCirrhosis question accuracy: 79.1%; HCC question accuracy: 74%
Study IDClinical DomainClinical ApplicationBaseline ModelBaseline or Fine-tuned LLMZero-shot or Prompt EngineeredLLM Main Outcomes
Ostrowska 202456Head and NeckHealth InformationChatGPT-3.5; ChatGPT-4; Google BardBaselineZero-shotChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores
Szczesniewski 202457GenitourinaryHealth InformationChatGPT; Google Bard; CopilotBaselineZero-shotEach chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied
Iannantuono 20247Pan-cancerHealth InformationChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard
Lee 202358Head and NeckHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons
Coskun 202359GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shotChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance
Lum 202460OrthopedicHealth InformationChatGPT-3.5; BardBaselineZero-shotBARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science
Dennstadt 202461Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shot94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”
O’Hagan 202362SkinHealth InformationChatGPT-4.0SexuZero-shotModels were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type
Kuscu 202363Head and NeckHealth InformationChatGPT-4.0BaselineZero-shot84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses
Deng 202464BreastHealth InformationChatGPT-3.5; ChatGPT-4.0; Claude2BaselineZero-shotChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making
Huang 202465Central Nervous SystemHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid
Li 202466Cardio-oncologyHealth InformationChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2BaselineZero-shotChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain
Hanai 202467GenitourinaryHealth InformationChatGPT-3.5BaselinePrompt EngineeredBoth bots recommended non-pharmacological interventions significantly more than pharmacological interventions
Valentini 202468OrthopedicHealth InformationChatGPT-3.5BaselineZero-shotThe median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor
Yalamanchili 202469Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotLLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness
Janopaul-Naylor 202470Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotAverage scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)
Musheyev 202471GenitourinaryHealth InformationChatGPT-3.5; Chat Sonic; Microsoft Bing AIBaselineZero-shotAI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low
Hermann 202372GynecologicHealth InformationChatGPT 3.5BaselineZero-shotScore of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question
Davis 202373GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shot14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)
Szczesniewski 202374GenitourinaryHealth InformationChatGPT-4.0BaselineZero-shotTreatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5
Walker 202375Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotMedian EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)
Patel 202376Pan-cancerHealth InformationChatGPT-3.5, Curie, Babbage, Ada, DavinciBaselineZero-shotChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%
Rahsepar 202377LungHealth InformationChatGPT-3.5; Google BardBaselineZero-shotChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect
Kassab 202378Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotResponses were accurate 86% of the time; 14% were innacurate; 0% harmful
Gortz 2023GenitourinaryHealth InformationSAP Conversational AIFine tunedZero-shot78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings
Koroglu 2023ThyroidHealth InformationChatGPT-3.5BaselineZero-shotQuestions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable
Huang 202379Pan-cancerHealth InformationChatGPT-4.0BaselinePrompt EngineeredTest: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness
Holmes 202380Pan-cancerHealth InformationChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZBaselinePrompt EngineeredChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering
Yeo 202381LiverHealth InformationChatGPT-3.5BaselineZero-shotCirrhosis question accuracy: 79.1%; HCC question accuracy: 74%
Table 2.

Characteristics and primary outcomes of studies that evaluated the medical accuracy of LLM-based chatbots as a health information resource.

Study IDClinical DomainClinical ApplicationBaseline ModelBaseline or Fine-tuned LLMZero-shot or Prompt EngineeredLLM Main Outcomes
Ostrowska 202456Head and NeckHealth InformationChatGPT-3.5; ChatGPT-4; Google BardBaselineZero-shotChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores
Szczesniewski 202457GenitourinaryHealth InformationChatGPT; Google Bard; CopilotBaselineZero-shotEach chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied
Iannantuono 20247Pan-cancerHealth InformationChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard
Lee 202358Head and NeckHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons
Coskun 202359GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shotChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance
Lum 202460OrthopedicHealth InformationChatGPT-3.5; BardBaselineZero-shotBARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science
Dennstadt 202461Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shot94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”
O’Hagan 202362SkinHealth InformationChatGPT-4.0SexuZero-shotModels were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type
Kuscu 202363Head and NeckHealth InformationChatGPT-4.0BaselineZero-shot84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses
Deng 202464BreastHealth InformationChatGPT-3.5; ChatGPT-4.0; Claude2BaselineZero-shotChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making
Huang 202465Central Nervous SystemHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid
Li 202466Cardio-oncologyHealth InformationChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2BaselineZero-shotChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain
Hanai 202467GenitourinaryHealth InformationChatGPT-3.5BaselinePrompt EngineeredBoth bots recommended non-pharmacological interventions significantly more than pharmacological interventions
Valentini 202468OrthopedicHealth InformationChatGPT-3.5BaselineZero-shotThe median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor
Yalamanchili 202469Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotLLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness
Janopaul-Naylor 202470Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotAverage scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)
Musheyev 202471GenitourinaryHealth InformationChatGPT-3.5; Chat Sonic; Microsoft Bing AIBaselineZero-shotAI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low
Hermann 202372GynecologicHealth InformationChatGPT 3.5BaselineZero-shotScore of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question
Davis 202373GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shot14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)
Szczesniewski 202374GenitourinaryHealth InformationChatGPT-4.0BaselineZero-shotTreatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5
Walker 202375Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotMedian EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)
Patel 202376Pan-cancerHealth InformationChatGPT-3.5, Curie, Babbage, Ada, DavinciBaselineZero-shotChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%
Rahsepar 202377LungHealth InformationChatGPT-3.5; Google BardBaselineZero-shotChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect
Kassab 202378Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotResponses were accurate 86% of the time; 14% were innacurate; 0% harmful
Gortz 2023GenitourinaryHealth InformationSAP Conversational AIFine tunedZero-shot78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings
Koroglu 2023ThyroidHealth InformationChatGPT-3.5BaselineZero-shotQuestions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable
Huang 202379Pan-cancerHealth InformationChatGPT-4.0BaselinePrompt EngineeredTest: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness
Holmes 202380Pan-cancerHealth InformationChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZBaselinePrompt EngineeredChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering
Yeo 202381LiverHealth InformationChatGPT-3.5BaselineZero-shotCirrhosis question accuracy: 79.1%; HCC question accuracy: 74%
Study IDClinical DomainClinical ApplicationBaseline ModelBaseline or Fine-tuned LLMZero-shot or Prompt EngineeredLLM Main Outcomes
Ostrowska 202456Head and NeckHealth InformationChatGPT-3.5; ChatGPT-4; Google BardBaselineZero-shotChatGPT 3.5 scored highest in safety and Global Quality Score; ChatGPT 4.0 and Bard had lower mean safety scores
Szczesniewski 202457GenitourinaryHealth InformationChatGPT; Google Bard; CopilotBaselineZero-shotEach chatbot explained the pathologies, detailed risk factors, and described treatments well;Quality and appropriateness of the information varied
Iannantuono 20247Pan-cancerHealth InformationChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT-4 and ChatGPT-3.5 had higher rates of reproducible, accurate, relevant, and readable responses compared to Google Bard
Lee 202358Head and NeckHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-generated pre-surgical information performed similarly to publicly available websites; ChatGPT was preferred 48% of the time by H&N surgeons
Coskun 202359GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shotChatGPT responded to all queries; Calculated metrics indicated a need for improvement in its performance
Lum 202460OrthopedicHealth InformationChatGPT-3.5; BardBaselineZero-shotBARD answered more questions correctly than ChatGPT; ChatGPT performed better in sports medicine and basic science; BARD performed better in basic science
Dennstadt 202461Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shot94.3% of MC answers were considered “valid”; 48% of open-ended answers were considered “acceptable,” “good,” or “very good”
O’Hagan 202362SkinHealth InformationChatGPT-4.0SexuZero-shotModels were 100% appropriate for a patient-facing portal; EHR responses were appropriate 85%-100% of the time, depending on question type
Kuscu 202363Head and NeckHealth InformationChatGPT-4.0BaselineZero-shot84.6% “comprehensive/correct,” 11% “incomplete/partially correct,” and 2.6% “incomplete/partially correct” responses
Deng 202464BreastHealth InformationChatGPT-3.5; ChatGPT-4.0; Claude2BaselineZero-shotChatGPT-4.0 outperformed ChatGPT-3.5 and Claude2 in average quality, relevance, and applicability; ChatGPT-4.0 scored higher than Claude2 in support and decision-making
Huang 202465Central Nervous SystemHealth InformationChatGPT-4.0BaselineZero-shotChatGPT-4.0 responses were consistent with guidelines; Responses ocasionally missed “red flag” symptoms; 50% of citations were deemed valid
Li 202466Cardio-oncologyHealth InformationChatGPT-3.5; ChatGPT-4.0; Google Bard; Meta Llama 2; Anthropic Claude 2BaselineZero-shotChatGPT-4.0 performed best in producing appropriate responses; All 5 LLMs underperformed in the treatment and prevention domain
Hanai 202467GenitourinaryHealth InformationChatGPT-3.5BaselinePrompt EngineeredBoth bots recommended non-pharmacological interventions significantly more than pharmacological interventions
Valentini 202468OrthopedicHealth InformationChatGPT-3.5BaselineZero-shotThe median score for ChatGPT’s answers was 18.3; 6 answers were very good, 9 were good, 5 were poor, and 5 were very poor
Yalamanchili 202469Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotLLM performed the same or better than expert answers in 94% of cases for correctness, 77% for completeness, and 91% for conciseness
Janopaul-Naylor 202470Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotAverage scores: 3.9 (ChatGPT), 3.2 (Bing); DISCERN scores: 4.1 (ChatGPT), 4.4 (Bing)
Musheyev 202471GenitourinaryHealth InformationChatGPT-3.5; Chat Sonic; Microsoft Bing AIBaselineZero-shotAI chatbot responses had moderate to high information quality; Understandability was moderate; Actionability was moderate to poor; Readibility was low
Hermann 202372GynecologicHealth InformationChatGPT 3.5BaselineZero-shotScore of 1 for 34 questions (best); Score of 2 for 19 questions; Score of 3 for 10 questions; Score of 4 for 1 question
Davis 202373GenitourinaryHealth InformationChatGPT-3.5BaselineZero-shot14/18 were appropriate; Mean SD Flesch Reading Ease score: 35.5 (SD = 10.2)
Szczesniewski 202374GenitourinaryHealth InformationChatGPT-4.0BaselineZero-shotTreatment Modalities: 3/5; Average for Pathologies: 4/5; BPH: 3/5
Walker 202375Pan-cancerHealth InformationChatGPT-4.0BaselineZero-shotMedian EQIP score: 16 (IQR 14.5-18)/36; Agreement between guideline and LLM responses: 60% (15/25)
Patel 202376Pan-cancerHealth InformationChatGPT-3.5, Curie, Babbage, Ada, DavinciBaselineZero-shotChatGPT: 96%; Davinci:72%; Curie:32%; Babbage: 6%; Ada; 2%
Rahsepar 202377LungHealth InformationChatGPT-3.5; Google BardBaselineZero-shotChatGPT: 70.8% correct, 11.7% partiallly correct, 17.5% incorrect; Bard: 19.2% no answer, 51.7% correct, 9.2% partially correct, 20% incorrect
Kassab 202378Pan-cancerHealth InformationChatGPT-3.5BaselineZero-shotResponses were accurate 86% of the time; 14% were innacurate; 0% harmful
Gortz 2023GenitourinaryHealth InformationSAP Conversational AIFine tunedZero-shot78% of users did not need assistance during usage; 89% experienced an increase in knowledge about PC; 100% of users would like to reuse a medical chatbot in clinical settings
Koroglu 2023ThyroidHealth InformationChatGPT-3.5BaselineZero-shotQuestions: Rater 1: 6.47+/-0.50; Rater 2: 6.18+/-0.96; Cases: Rater 1: largely correct, safe and usable; Rater 2: partially or moslty correct, safe and usable
Huang 202379Pan-cancerHealth InformationChatGPT-4.0BaselinePrompt EngineeredTest: GPT 3.5 and 4 had scores of 62.05% and 78.77% respectively; Cases: GPT4 is able to suggest a personalized treatment to each case with high correctness and comprehensiveness
Holmes 202380Pan-cancerHealth InformationChatGPT-3.5, ChatGPT-4.0, Google Bard, BLOOMZBaselinePrompt EngineeredChatGPT-4.0 outperformed all other LLMs and medical physicists on average; Accuracy improved when prompted to explain before answering
Yeo 202381LiverHealth InformationChatGPT-3.5BaselineZero-shotCirrhosis question accuracy: 79.1%; HCC question accuracy: 74%
Close
This Feature Is Available To Subscribers Only

Sign In or Create an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close