Characteristics and primary outcomes of studies that evaluated LLM-based chatbot medical accuracy as a clinical decision support tool.
Study ID . | Clinical domain . | Clinical application . | Baseline model . | Baseline or fine-tuned LLM . | Zero-shot or prompt engineered . | LLM main outcomes . |
---|---|---|---|---|---|---|
Shiraishi 202326 | Skin | Diagnosis | ChatGPT-4.0; Bard; Bing | Baseline | Zero-shot | Bing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images. |
Cirone 202427 | Skin | Diagnosis | ChatGPT-4.0V; LLaVA | Baseline | Zero-shot | GPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features |
Yang 202428 | Orthopedic | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Before few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance |
Cozzi 202429 | Breast | Diagnosis | GPT-3.5; GPT-4.0; Bard | Baseline | Zero-shot | Agreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate |
Shifai 202430 | Skin | Diagnosis | ChatGPT-4.0V | Baseline | Zero-shot | ChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36% |
Sievert 202431 | Head and Neck | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | ChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837) |
Sievert 202431 | Thyroid | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories |
Wu 202432 | Thyroid | Diagnosis | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard |
Ma 202433 | Esophagus | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Accuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957 |
Rundle 202434 | Skin | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | Diagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6% |
Horiuchi 202435 | Central Nervous System | Diagnosis | ChatGPT-4.0 | Baseline | Prompt Engineered | ChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations |
Tariq 202236 | Liver | Diagnosis | BERT | Baseline | Zero-shot | Malignant: 34%; Benign classification: 98% |
Wang 202437 | Thyroid | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy |
Patel 202438 | Gynecologic | Genetic Counseling | ChatGPT-3.5 | Baseline | Zero-shot | 82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses |
Erdat 202439 | Pan-cancer | Management | ChatGPT-3.5; Bing | Baseline | Zero-shot | ChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols |
Moll 202440 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician |
Gamble 202441 | Lung | Management | ChatGPT-3.5; ChatGPT-4.0 | Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5) | Zero-shot | ChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines |
Guo 202442 | Central Nervous System | Management | neuroGPT-X | Baseline; Fine-tuned | Zero-shot; Prompt Engineered | LLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts |
Marchi 202443 | Head and Neck | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations |
Choo 202444 | Colorectal | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7% |
Kuk 202445 | Geriatric Oncology | Management | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications |
Sorin 202346 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | 70% of ChatGPTs recommendations were comparable to recommendations by the tumor board |
Schulte 202347 | Pan-cancer | Management | ChatGPT-3.5 | Baseline | Zero-shot | Overall VTQ: 0.77 |
Griewing 202348 | Breast | Management | ChatGPT-3.5 | Baseline | Prompt Engineered | Concordance: 58.8% |
Chiarelli 202449 | Genitourinary | Screening | ChatGPT-3.5; ChatGPT-4 | Baseline | Prompt Engineered | ChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability |
Pereyra 202450 | Colorectal | Screening | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers |
Nguyen 202351 | Pan-cancer | Screening | ChatGPT-4.0; Bard | Baseline | Zero-shot; Prompt Engineered | ChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts |
Atarere 202452 | Colorectal | Screening | ChatGPT-4.0; BingChat; YouChat | Baseline | Zero-shot | ChatGPT and YouChat provided RA responses more often than BingChat for both question sets |
Rao 202353 | Breast | Screening | ChatGPT-3.5; ChatGPT-4.0 | Baseline | Zero-shot | Breast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3% |
Ahmad 202354 | Pan-cancer | Management | BioBERT; RoBERTa | Baseline | Zero-shot | The proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs |
Benary 202355 | Pan-cancer | Management | ChatGPT-3.5; Galactica; Perplexity; BioMedLM | Baseline | Prompt Engineered | LLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points |
Study ID . | Clinical domain . | Clinical application . | Baseline model . | Baseline or fine-tuned LLM . | Zero-shot or prompt engineered . | LLM main outcomes . |
---|---|---|---|---|---|---|
Shiraishi 202326 | Skin | Diagnosis | ChatGPT-4.0; Bard; Bing | Baseline | Zero-shot | Bing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images. |
Cirone 202427 | Skin | Diagnosis | ChatGPT-4.0V; LLaVA | Baseline | Zero-shot | GPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features |
Yang 202428 | Orthopedic | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Before few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance |
Cozzi 202429 | Breast | Diagnosis | GPT-3.5; GPT-4.0; Bard | Baseline | Zero-shot | Agreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate |
Shifai 202430 | Skin | Diagnosis | ChatGPT-4.0V | Baseline | Zero-shot | ChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36% |
Sievert 202431 | Head and Neck | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | ChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837) |
Sievert 202431 | Thyroid | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories |
Wu 202432 | Thyroid | Diagnosis | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard |
Ma 202433 | Esophagus | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Accuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957 |
Rundle 202434 | Skin | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | Diagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6% |
Horiuchi 202435 | Central Nervous System | Diagnosis | ChatGPT-4.0 | Baseline | Prompt Engineered | ChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations |
Tariq 202236 | Liver | Diagnosis | BERT | Baseline | Zero-shot | Malignant: 34%; Benign classification: 98% |
Wang 202437 | Thyroid | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy |
Patel 202438 | Gynecologic | Genetic Counseling | ChatGPT-3.5 | Baseline | Zero-shot | 82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses |
Erdat 202439 | Pan-cancer | Management | ChatGPT-3.5; Bing | Baseline | Zero-shot | ChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols |
Moll 202440 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician |
Gamble 202441 | Lung | Management | ChatGPT-3.5; ChatGPT-4.0 | Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5) | Zero-shot | ChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines |
Guo 202442 | Central Nervous System | Management | neuroGPT-X | Baseline; Fine-tuned | Zero-shot; Prompt Engineered | LLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts |
Marchi 202443 | Head and Neck | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations |
Choo 202444 | Colorectal | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7% |
Kuk 202445 | Geriatric Oncology | Management | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications |
Sorin 202346 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | 70% of ChatGPTs recommendations were comparable to recommendations by the tumor board |
Schulte 202347 | Pan-cancer | Management | ChatGPT-3.5 | Baseline | Zero-shot | Overall VTQ: 0.77 |
Griewing 202348 | Breast | Management | ChatGPT-3.5 | Baseline | Prompt Engineered | Concordance: 58.8% |
Chiarelli 202449 | Genitourinary | Screening | ChatGPT-3.5; ChatGPT-4 | Baseline | Prompt Engineered | ChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability |
Pereyra 202450 | Colorectal | Screening | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers |
Nguyen 202351 | Pan-cancer | Screening | ChatGPT-4.0; Bard | Baseline | Zero-shot; Prompt Engineered | ChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts |
Atarere 202452 | Colorectal | Screening | ChatGPT-4.0; BingChat; YouChat | Baseline | Zero-shot | ChatGPT and YouChat provided RA responses more often than BingChat for both question sets |
Rao 202353 | Breast | Screening | ChatGPT-3.5; ChatGPT-4.0 | Baseline | Zero-shot | Breast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3% |
Ahmad 202354 | Pan-cancer | Management | BioBERT; RoBERTa | Baseline | Zero-shot | The proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs |
Benary 202355 | Pan-cancer | Management | ChatGPT-3.5; Galactica; Perplexity; BioMedLM | Baseline | Prompt Engineered | LLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points |
Characteristics and primary outcomes of studies that evaluated LLM-based chatbot medical accuracy as a clinical decision support tool.
Study ID . | Clinical domain . | Clinical application . | Baseline model . | Baseline or fine-tuned LLM . | Zero-shot or prompt engineered . | LLM main outcomes . |
---|---|---|---|---|---|---|
Shiraishi 202326 | Skin | Diagnosis | ChatGPT-4.0; Bard; Bing | Baseline | Zero-shot | Bing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images. |
Cirone 202427 | Skin | Diagnosis | ChatGPT-4.0V; LLaVA | Baseline | Zero-shot | GPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features |
Yang 202428 | Orthopedic | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Before few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance |
Cozzi 202429 | Breast | Diagnosis | GPT-3.5; GPT-4.0; Bard | Baseline | Zero-shot | Agreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate |
Shifai 202430 | Skin | Diagnosis | ChatGPT-4.0V | Baseline | Zero-shot | ChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36% |
Sievert 202431 | Head and Neck | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | ChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837) |
Sievert 202431 | Thyroid | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories |
Wu 202432 | Thyroid | Diagnosis | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard |
Ma 202433 | Esophagus | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Accuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957 |
Rundle 202434 | Skin | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | Diagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6% |
Horiuchi 202435 | Central Nervous System | Diagnosis | ChatGPT-4.0 | Baseline | Prompt Engineered | ChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations |
Tariq 202236 | Liver | Diagnosis | BERT | Baseline | Zero-shot | Malignant: 34%; Benign classification: 98% |
Wang 202437 | Thyroid | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy |
Patel 202438 | Gynecologic | Genetic Counseling | ChatGPT-3.5 | Baseline | Zero-shot | 82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses |
Erdat 202439 | Pan-cancer | Management | ChatGPT-3.5; Bing | Baseline | Zero-shot | ChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols |
Moll 202440 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician |
Gamble 202441 | Lung | Management | ChatGPT-3.5; ChatGPT-4.0 | Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5) | Zero-shot | ChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines |
Guo 202442 | Central Nervous System | Management | neuroGPT-X | Baseline; Fine-tuned | Zero-shot; Prompt Engineered | LLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts |
Marchi 202443 | Head and Neck | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations |
Choo 202444 | Colorectal | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7% |
Kuk 202445 | Geriatric Oncology | Management | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications |
Sorin 202346 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | 70% of ChatGPTs recommendations were comparable to recommendations by the tumor board |
Schulte 202347 | Pan-cancer | Management | ChatGPT-3.5 | Baseline | Zero-shot | Overall VTQ: 0.77 |
Griewing 202348 | Breast | Management | ChatGPT-3.5 | Baseline | Prompt Engineered | Concordance: 58.8% |
Chiarelli 202449 | Genitourinary | Screening | ChatGPT-3.5; ChatGPT-4 | Baseline | Prompt Engineered | ChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability |
Pereyra 202450 | Colorectal | Screening | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers |
Nguyen 202351 | Pan-cancer | Screening | ChatGPT-4.0; Bard | Baseline | Zero-shot; Prompt Engineered | ChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts |
Atarere 202452 | Colorectal | Screening | ChatGPT-4.0; BingChat; YouChat | Baseline | Zero-shot | ChatGPT and YouChat provided RA responses more often than BingChat for both question sets |
Rao 202353 | Breast | Screening | ChatGPT-3.5; ChatGPT-4.0 | Baseline | Zero-shot | Breast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3% |
Ahmad 202354 | Pan-cancer | Management | BioBERT; RoBERTa | Baseline | Zero-shot | The proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs |
Benary 202355 | Pan-cancer | Management | ChatGPT-3.5; Galactica; Perplexity; BioMedLM | Baseline | Prompt Engineered | LLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points |
Study ID . | Clinical domain . | Clinical application . | Baseline model . | Baseline or fine-tuned LLM . | Zero-shot or prompt engineered . | LLM main outcomes . |
---|---|---|---|---|---|---|
Shiraishi 202326 | Skin | Diagnosis | ChatGPT-4.0; Bard; Bing | Baseline | Zero-shot | Bing had an average accuracy of 58%. ChatGPT-4 and Bard refused to answer the clinical queries with images. |
Cirone 202427 | Skin | Diagnosis | ChatGPT-4.0V; LLaVA | Baseline | Zero-shot | GPT-4V outperformed LLaVA in all examined areas; GPT-4V provided thorough descriptions of relevant ABCDE features |
Yang 202428 | Orthopedic | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Before few-shot learning, ChatGPT’s accuracy, sensitivity, and specificity were 0.73, 0.95, and 0.58, respectively; After, these metrics approached physician-level performance |
Cozzi 202429 | Breast | Diagnosis | GPT-3.5; GPT-4.0; Bard | Baseline | Zero-shot | Agreement between human readers was almost perfect; Agreement between human readers and LLMs was moderate |
Shifai 202430 | Skin | Diagnosis | ChatGPT-4.0V | Baseline | Zero-shot | ChatGPT Vision had a sensitivity of 32%, specificity of 40%, and overall diagnostic accuracy of 36% |
Sievert 202431 | Head and Neck | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | ChatGPT-4.0V scored lower than experts, acheiveing accuracy of 71.2% with an almost perfect intra-rater agreement (κ = 0.837) |
Sievert 202431 | Thyroid | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT achieved a sensitivity of 86.7%, specificity of 10.7%, and accuracy of 68% when distinguishing between low-risk and high-risk categories |
Wu 202432 | Thyroid | Diagnosis | ChatGPT-3.5; ChatGPT-4.0; Google Bard | Baseline | Zero-shot | ChatGPT 4.0 and Bard displayed higher intra-LLM agreement compared to ChatGPT-3.5; ChatGPT 4.0 had higher accuracy, sensitivity, and AUC than Bard |
Ma 202433 | Esophagus | Diagnosis | ChatGPT-3.5 | Baseline | Prompt Engineered | Accuracies were 0.925 to 1.0; Precisions were 0.934 to 0.969; Recalls were 0.925 to 1.0; F1-scores were 0.928 to 0.957 |
Rundle 202434 | Skin | Diagnosis | ChatGPT-4.0 | Baseline | Zero-shot | Diagnosis accuracy was 66.7%; For malignant neoplasms, diagnosis accuracy was 58.8%; For benign neoplasms, diagnosis accuracy was 69.6% |
Horiuchi 202435 | Central Nervous System | Diagnosis | ChatGPT-4.0 | Baseline | Prompt Engineered | ChatGPT’s diagnostic accuracy was 50% for the final diagnosis; There were no significant differences in accuracy rates among anatomical locations |
Tariq 202236 | Liver | Diagnosis | BERT | Baseline | Zero-shot | Malignant: 34%; Benign classification: 98% |
Wang 202437 | Thyroid | Diagnosis | ChatGPT-4.0V | Baseline | Prompt Engineered | GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression; GPT-4 showed limitations in diagnostic accuracy |
Patel 202438 | Gynecologic | Genetic Counseling | ChatGPT-3.5 | Baseline | Zero-shot | 82.% “correct and comprehensive,” 15% “correct but not comprehensive,” 2.5% “partially incorrect” responses |
Erdat 202439 | Pan-cancer | Management | ChatGPT-3.5; Bing | Baseline | Zero-shot | ChatGPT performed 5 of 9 correct protocols; Bing performed 4 of 9 correct protocols |
Moll 202440 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT’s answers were mostly correct but some contained inaccuracies; Patients expressed a preference for the presence of a physician |
Gamble 202441 | Lung | Management | ChatGPT-3.5; ChatGPT-4.0 | Baseline (GPT-3.5 and GPT-4.0); Fine-tuned (GPT-3.5) | Zero-shot | ChatGPT-3.5’s accuracy was 0.058 without guidelines and 0.42 with guidelines; ChatGPT-4 accuracy was 0.15 without guidelines and 0.66 with guidelines |
Guo 202442 | Central Nervous System | Management | neuroGPT-X | Baseline; Fine-tuned | Zero-shot; Prompt Engineered | LLMs rated similarly to experts for accuracy, coherence, relevance, thoroughness, and performance; LLMs responsed faster than experts |
Marchi 202443 | Head and Neck | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed better for primary treatment than for adjuvant treatment or follow-up recommendations |
Choo 202444 | Colorectal | Management | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT adhered to oncological principles in all cases; The concordance rate between chatGPT and the MDT board was 86.7% |
Kuk 202445 | Geriatric Oncology | Management | ChatGPT-4.0 | Baseline | Zero-shot | ChatGPT-4 identified medication-related side effects and suggested appropriate medications; GPT-4 was unable to suggest initial dosages of medications |
Sorin 202346 | Breast | Management | ChatGPT-3.5 | Baseline | Zero-shot | 70% of ChatGPTs recommendations were comparable to recommendations by the tumor board |
Schulte 202347 | Pan-cancer | Management | ChatGPT-3.5 | Baseline | Zero-shot | Overall VTQ: 0.77 |
Griewing 202348 | Breast | Management | ChatGPT-3.5 | Baseline | Prompt Engineered | Concordance: 58.8% |
Chiarelli 202449 | Genitourinary | Screening | ChatGPT-3.5; ChatGPT-4 | Baseline | Prompt Engineered | ChatGPT-4 performed better than ChatGPT-3.5 for accuracy, clarity, and conciseness; ChatGPT-4 exhibited high readability |
Pereyra 202450 | Colorectal | Screening | ChatGPT-3.5 | Baseline | Zero-shot | ChatGPT performed significantly lower than the physician groups, averaging a mean of 4.5/10 correct answers |
Nguyen 202351 | Pan-cancer | Screening | ChatGPT-4.0; Bard | Baseline | Zero-shot; Prompt Engineered | ChatGPT and Bard performed similarly on OE prompts; ChatGPT performed better in SATA scenarios; Prompt engineering improved LLM outputs in OE prompts |
Atarere 202452 | Colorectal | Screening | ChatGPT-4.0; BingChat; YouChat | Baseline | Zero-shot | ChatGPT and YouChat provided RA responses more often than BingChat for both question sets |
Rao 202353 | Breast | Screening | ChatGPT-3.5; ChatGPT-4.0 | Baseline | Zero-shot | Breast cancer screening: Free-text score: 1.83/2, Multiple-choice score: 88.9%; Breast pain: Free-text score: 1.125/2, Multiple-choice score: 58.3% |
Ahmad 202354 | Pan-cancer | Management | BioBERT; RoBERTa | Baseline | Zero-shot | The proposed system achieved high accuracy and F1-score in predicting cancer treatment from EHRs |
Benary 202355 | Pan-cancer | Management | ChatGPT-3.5; Galactica; Perplexity; BioMedLM | Baseline | Prompt Engineered | LLMs F1 scores of 0.04 (Biomed LM), 0.14 (Perplexity), 0.17 (ChatGPT), and 0.19 (Galactica); LLM-generated combined treatment options and clinical treatment options yielded median (IQR) scores of 7.5 (5.3-9.0) and 8.0 (7.5-9.5) points, respectively; Manually generated options reached a median score of 2.0 (1.0-3.0) points |
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.