Table 1.

Characteristics and primary outcomes of studies that evaluated LLM-based chatbot medical accuracy as a clinical decision support tool.

Study IDClinical domainClinical applicationBaseline modelBaseline or fine-tuned LLMZero-shot or prompt engineeredLLM main outcomes
Shiraishi 202326SkinDiagnosisChatGPT-4.0; Bard; BingBaselineZero-shotBing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images.
Cirone 202427SkinDiagnosisChatGPT-4.0V; LLaVABaselineZero-shotGPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features
Yang 202428OrthopedicDiagnosisChatGPT-3.5BaselinePrompt EngineeredBefore few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance
Cozzi 202429BreastDiagnosisGPT-3.5; GPT-4.0; BardBaselineZero-shotAgreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate
Shifai 202430SkinDiagnosisChatGPT-4.0VBaselineZero-shotChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36%
Sievert 202431Head and NeckDiagnosisChatGPT-4.0VBaselinePrompt EngineeredChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837)
Sievert 202431ThyroidDiagnosisChatGPT-4.0BaselineZero-shotChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories
Wu 202432ThyroidDiagnosisChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard
Ma 202433EsophagusDiagnosisChatGPT-3.5BaselinePrompt EngineeredAccuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957
Rundle 202434SkinDiagnosisChatGPT-4.0BaselineZero-shotDiagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6%
Horiuchi 202435Central Nervous SystemDiagnosisChatGPT-4.0BaselinePrompt EngineeredChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations
Tariq 202236LiverDiagnosisBERTBaselineZero-shotMalignant: 34%; Benign classification: 98%
Wang 202437ThyroidDiagnosisChatGPT-4.0VBaselinePrompt EngineeredGPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy
Patel 202438GynecologicGenetic CounselingChatGPT-3.5BaselineZero-shot82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses
Erdat 202439Pan-cancerManagementChatGPT-3.5; BingBaselineZero-shotChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols
Moll 202440BreastManagementChatGPT-3.5BaselineZero-shotChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician
Gamble 202441LungManagementChatGPT-3.5; ChatGPT-4.0Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5)Zero-shotChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines
Guo 202442Central Nervous SystemManagementneuroGPT-XBaseline; Fine-tunedZero-shot; Prompt EngineeredLLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts
Marchi 202443Head and NeckManagementChatGPT-3.5BaselineZero-shotChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations
Choo 202444ColorectalManagementChatGPT-3.5BaselineZero-shotChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7%
Kuk 202445Geriatric OncologyManagementChatGPT-4.0BaselineZero-shotChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications
Sorin 202346BreastManagementChatGPT-3.5BaselineZero-shot70% of ChatGPTs recommendations were comparable to recommendations by the tumor board
Schulte 202347Pan-cancerManagementChatGPT-3.5BaselineZero-shotOverall VTQ: 0.77
Griewing 202348BreastManagementChatGPT-3.5BaselinePrompt EngineeredConcordance: 58.8%
Chiarelli 202449GenitourinaryScreeningChatGPT-3.5; ChatGPT-4BaselinePrompt EngineeredChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability
Pereyra 202450ColorectalScreeningChatGPT-3.5BaselineZero-shotChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers
Nguyen 202351Pan-cancerScreeningChatGPT-4.0; BardBaselineZero-shot; Prompt EngineeredChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts
Atarere 202452ColorectalScreeningChatGPT-4.0; BingChat; YouChatBaselineZero-shotChatGPT and YouChat provided RA responses more often than BingChat for both question sets
Rao 202353BreastScreeningChatGPT-3.5; ChatGPT-4.0BaselineZero-shotBreast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3%
Ahmad 202354Pan-cancerManagementBioBERT; RoBERTaBaselineZero-shotThe proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs
Benary 202355Pan-cancerManagementChatGPT-3.5; Galactica; Perplexity; BioMedLMBaselinePrompt EngineeredLLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points
Study IDClinical domainClinical applicationBaseline modelBaseline or fine-tuned LLMZero-shot or prompt engineeredLLM main outcomes
Shiraishi 202326SkinDiagnosisChatGPT-4.0; Bard; BingBaselineZero-shotBing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images.
Cirone 202427SkinDiagnosisChatGPT-4.0V; LLaVABaselineZero-shotGPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features
Yang 202428OrthopedicDiagnosisChatGPT-3.5BaselinePrompt EngineeredBefore few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance
Cozzi 202429BreastDiagnosisGPT-3.5; GPT-4.0; BardBaselineZero-shotAgreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate
Shifai 202430SkinDiagnosisChatGPT-4.0VBaselineZero-shotChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36%
Sievert 202431Head and NeckDiagnosisChatGPT-4.0VBaselinePrompt EngineeredChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837)
Sievert 202431ThyroidDiagnosisChatGPT-4.0BaselineZero-shotChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories
Wu 202432ThyroidDiagnosisChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard
Ma 202433EsophagusDiagnosisChatGPT-3.5BaselinePrompt EngineeredAccuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957
Rundle 202434SkinDiagnosisChatGPT-4.0BaselineZero-shotDiagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6%
Horiuchi 202435Central Nervous SystemDiagnosisChatGPT-4.0BaselinePrompt EngineeredChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations
Tariq 202236LiverDiagnosisBERTBaselineZero-shotMalignant: 34%; Benign classification: 98%
Wang 202437ThyroidDiagnosisChatGPT-4.0VBaselinePrompt EngineeredGPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy
Patel 202438GynecologicGenetic CounselingChatGPT-3.5BaselineZero-shot82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses
Erdat 202439Pan-cancerManagementChatGPT-3.5; BingBaselineZero-shotChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols
Moll 202440BreastManagementChatGPT-3.5BaselineZero-shotChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician
Gamble 202441LungManagementChatGPT-3.5; ChatGPT-4.0Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5)Zero-shotChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines
Guo 202442Central Nervous SystemManagementneuroGPT-XBaseline; Fine-tunedZero-shot; Prompt EngineeredLLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts
Marchi 202443Head and NeckManagementChatGPT-3.5BaselineZero-shotChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations
Choo 202444ColorectalManagementChatGPT-3.5BaselineZero-shotChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7%
Kuk 202445Geriatric OncologyManagementChatGPT-4.0BaselineZero-shotChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications
Sorin 202346BreastManagementChatGPT-3.5BaselineZero-shot70% of ChatGPTs recommendations were comparable to recommendations by the tumor board
Schulte 202347Pan-cancerManagementChatGPT-3.5BaselineZero-shotOverall VTQ: 0.77
Griewing 202348BreastManagementChatGPT-3.5BaselinePrompt EngineeredConcordance: 58.8%
Chiarelli 202449GenitourinaryScreeningChatGPT-3.5; ChatGPT-4BaselinePrompt EngineeredChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability
Pereyra 202450ColorectalScreeningChatGPT-3.5BaselineZero-shotChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers
Nguyen 202351Pan-cancerScreeningChatGPT-4.0; BardBaselineZero-shot; Prompt EngineeredChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts
Atarere 202452ColorectalScreeningChatGPT-4.0; BingChat; YouChatBaselineZero-shotChatGPT and YouChat provided RA responses more often than BingChat for both question sets
Rao 202353BreastScreeningChatGPT-3.5; ChatGPT-4.0BaselineZero-shotBreast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3%
Ahmad 202354Pan-cancerManagementBioBERT; RoBERTaBaselineZero-shotThe proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs
Benary 202355Pan-cancerManagementChatGPT-3.5; Galactica; Perplexity; BioMedLMBaselinePrompt EngineeredLLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points
Table 1.

Characteristics and primary outcomes of studies that evaluated LLM-based chatbot medical accuracy as a clinical decision support tool.

Study IDClinical domainClinical applicationBaseline modelBaseline or fine-tuned LLMZero-shot or prompt engineeredLLM main outcomes
Shiraishi 202326SkinDiagnosisChatGPT-4.0; Bard; BingBaselineZero-shotBing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images.
Cirone 202427SkinDiagnosisChatGPT-4.0V; LLaVABaselineZero-shotGPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features
Yang 202428OrthopedicDiagnosisChatGPT-3.5BaselinePrompt EngineeredBefore few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance
Cozzi 202429BreastDiagnosisGPT-3.5; GPT-4.0; BardBaselineZero-shotAgreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate
Shifai 202430SkinDiagnosisChatGPT-4.0VBaselineZero-shotChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36%
Sievert 202431Head and NeckDiagnosisChatGPT-4.0VBaselinePrompt EngineeredChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837)
Sievert 202431ThyroidDiagnosisChatGPT-4.0BaselineZero-shotChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories
Wu 202432ThyroidDiagnosisChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard
Ma 202433EsophagusDiagnosisChatGPT-3.5BaselinePrompt EngineeredAccuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957
Rundle 202434SkinDiagnosisChatGPT-4.0BaselineZero-shotDiagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6%
Horiuchi 202435Central Nervous SystemDiagnosisChatGPT-4.0BaselinePrompt EngineeredChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations
Tariq 202236LiverDiagnosisBERTBaselineZero-shotMalignant: 34%; Benign classification: 98%
Wang 202437ThyroidDiagnosisChatGPT-4.0VBaselinePrompt EngineeredGPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy
Patel 202438GynecologicGenetic CounselingChatGPT-3.5BaselineZero-shot82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses
Erdat 202439Pan-cancerManagementChatGPT-3.5; BingBaselineZero-shotChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols
Moll 202440BreastManagementChatGPT-3.5BaselineZero-shotChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician
Gamble 202441LungManagementChatGPT-3.5; ChatGPT-4.0Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5)Zero-shotChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines
Guo 202442Central Nervous SystemManagementneuroGPT-XBaseline; Fine-tunedZero-shot; Prompt EngineeredLLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts
Marchi 202443Head and NeckManagementChatGPT-3.5BaselineZero-shotChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations
Choo 202444ColorectalManagementChatGPT-3.5BaselineZero-shotChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7%
Kuk 202445Geriatric OncologyManagementChatGPT-4.0BaselineZero-shotChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications
Sorin 202346BreastManagementChatGPT-3.5BaselineZero-shot70% of ChatGPTs recommendations were comparable to recommendations by the tumor board
Schulte 202347Pan-cancerManagementChatGPT-3.5BaselineZero-shotOverall VTQ: 0.77
Griewing 202348BreastManagementChatGPT-3.5BaselinePrompt EngineeredConcordance: 58.8%
Chiarelli 202449GenitourinaryScreeningChatGPT-3.5; ChatGPT-4BaselinePrompt EngineeredChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability
Pereyra 202450ColorectalScreeningChatGPT-3.5BaselineZero-shotChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers
Nguyen 202351Pan-cancerScreeningChatGPT-4.0; BardBaselineZero-shot; Prompt EngineeredChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts
Atarere 202452ColorectalScreeningChatGPT-4.0; BingChat; YouChatBaselineZero-shotChatGPT and YouChat provided RA responses more often than BingChat for both question sets
Rao 202353BreastScreeningChatGPT-3.5; ChatGPT-4.0BaselineZero-shotBreast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3%
Ahmad 202354Pan-cancerManagementBioBERT; RoBERTaBaselineZero-shotThe proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs
Benary 202355Pan-cancerManagementChatGPT-3.5; Galactica; Perplexity; BioMedLMBaselinePrompt EngineeredLLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points
Study IDClinical domainClinical applicationBaseline modelBaseline or fine-tuned LLMZero-shot or prompt engineeredLLM main outcomes
Shiraishi 202326SkinDiagnosisChatGPT-4.0; Bard; BingBaselineZero-shotBing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images.
Cirone 202427SkinDiagnosisChatGPT-4.0V; LLaVABaselineZero-shotGPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features
Yang 202428OrthopedicDiagnosisChatGPT-3.5BaselinePrompt EngineeredBefore few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance
Cozzi 202429BreastDiagnosisGPT-3.5; GPT-4.0; BardBaselineZero-shotAgreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate
Shifai 202430SkinDiagnosisChatGPT-4.0VBaselineZero-shotChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36%
Sievert 202431Head and NeckDiagnosisChatGPT-4.0VBaselinePrompt EngineeredChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837)
Sievert 202431ThyroidDiagnosisChatGPT-4.0BaselineZero-shotChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories
Wu 202432ThyroidDiagnosisChatGPT-3.5; ChatGPT-4.0; Google BardBaselineZero-shotChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard
Ma 202433EsophagusDiagnosisChatGPT-3.5BaselinePrompt EngineeredAccuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957
Rundle 202434SkinDiagnosisChatGPT-4.0BaselineZero-shotDiagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6%
Horiuchi 202435Central Nervous SystemDiagnosisChatGPT-4.0BaselinePrompt EngineeredChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations
Tariq 202236LiverDiagnosisBERTBaselineZero-shotMalignant: 34%; Benign classification: 98%
Wang 202437ThyroidDiagnosisChatGPT-4.0VBaselinePrompt EngineeredGPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy
Patel 202438GynecologicGenetic CounselingChatGPT-3.5BaselineZero-shot82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses
Erdat 202439Pan-cancerManagementChatGPT-3.5; BingBaselineZero-shotChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols
Moll 202440BreastManagementChatGPT-3.5BaselineZero-shotChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician
Gamble 202441LungManagementChatGPT-3.5; ChatGPT-4.0Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5)Zero-shotChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines
Guo 202442Central Nervous SystemManagementneuroGPT-XBaseline; Fine-tunedZero-shot; Prompt EngineeredLLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts
Marchi 202443Head and NeckManagementChatGPT-3.5BaselineZero-shotChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations
Choo 202444ColorectalManagementChatGPT-3.5BaselineZero-shotChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7%
Kuk 202445Geriatric OncologyManagementChatGPT-4.0BaselineZero-shotChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications
Sorin 202346BreastManagementChatGPT-3.5BaselineZero-shot70% of ChatGPTs recommendations were comparable to recommendations by the tumor board
Schulte 202347Pan-cancerManagementChatGPT-3.5BaselineZero-shotOverall VTQ: 0.77
Griewing 202348BreastManagementChatGPT-3.5BaselinePrompt EngineeredConcordance: 58.8%
Chiarelli 202449GenitourinaryScreeningChatGPT-3.5; ChatGPT-4BaselinePrompt EngineeredChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability
Pereyra 202450ColorectalScreeningChatGPT-3.5BaselineZero-shotChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers
Nguyen 202351Pan-cancerScreeningChatGPT-4.0; BardBaselineZero-shot; Prompt EngineeredChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts
Atarere 202452ColorectalScreeningChatGPT-4.0; BingChat; YouChatBaselineZero-shotChatGPT and YouChat provided RA responses more often than BingChat for both question sets
Rao 202353BreastScreeningChatGPT-3.5; ChatGPT-4.0BaselineZero-shotBreast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3%
Ahmad 202354Pan-cancerManagementBioBERT; RoBERTaBaselineZero-shotThe proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs
Benary 202355Pan-cancerManagementChatGPT-3.5; Galactica; Perplexity; BioMedLMBaselinePrompt EngineeredLLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points
Close
This Feature Is Available To Subscribers Only

Sign In or Create an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close