Author and year . | Data source . | Chatbot with version . | Study objective and prompt formulation . | Model efficacy and accuracy . | Comparative performance and contextual evaluation . |
---|---|---|---|---|---|
Lyu et al, 202357 | Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database | GPT-3 and GPT-4 | Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions. | Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report. | ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score. |
Holmes et al, 202358 | Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist. | ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ | Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis. | ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials. | ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4. |
Rao et al, 202359 | Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified | ChatGPT (GPT-3.5 and GPT-4) |
|
| ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios. |
Sorin et al, 202360 | 10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic. | ChatGPT 3.5 |
| ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400. | ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance. |
Grünebaum et al, 202361 | 14 questions about obstetrics and gynaecology, conceived by 4 physicians | ChatGPT 3.5 |
| No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses. | No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data. |
Yeo et al, 202362 | 164 questions about cirrhosis and HCC | ChatGPT |
| High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations. | ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees. |
Zhu et al, 202363 | 22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors | ChatGPT-3.5, ChatGPT 4, and other LLMs |
| Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable. | No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs |
Sorin et al, 202364 | Clinical information of 10 consecutive patients from a breast tumour board | ChatGPT-3.5 |
| 70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement. | No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases. |
Chen et al, 202365 | 104 prompts on breast, prostate, and lung cancer based on NCCN guidelines | ChatGPT (gpt-3.5-turbo-0301) |
| ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type. | No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations. |
Nakamura et al, 202366 | MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports | GPT-3.5 Turbo, GPT-4 |
| GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category. | GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category. |
Truhn et al, 202467 | Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL | GPT-4 |
| GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision. | GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step. |
Sushil et al, 202468 | 769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021 | GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B |
| GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34. | GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance. |
Liang et al, 202469 | 80 RCC-related clinical questions, provided by urology experts | ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo |
| ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy. | ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training. |
Marchi et al, 202470 | 68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024 | ChatGPT-3.5 |
|
| ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions. |
Gu et al, 202471 | The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul. | GPT-4 (version gpt-4-0314) |
|
| GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases. |
Lee et al, 202372 | 84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai. | FastChat-T5 (3B-parameter LLM) |
| Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2. | The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions. |
Kuşcu et al, 202373 | 154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media. | ChatGPT Plus, based on GPT-4 (March 2023 version) |
| ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported. | ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications. |
Gibson et al, 202474 | The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends. | ChatGPT-4 |
| The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses. | ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed. |
Huang et al, 202475 | Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports. | ChatGPT-3.5-turbo-16k, GPT-4-turbo |
| ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%. | ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports. |
Huang et al, 202376 | The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection. | ChatGPT-3.5 and ChatGPT-4 |
|
| Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings. |
Dennstädt et al, 202377 | 70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists. | GPT-3.5-turbo |
|
| ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence. |
Choi et al, 202378 | Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports. | ChatGPT (GPT-3.5-turbo) |
| The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees. | LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients. |
Rydzewski et al, 202479 | 2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage. | GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1 |
| GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly. | Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models. |
Lee et al, 202472 | The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU. | ChatGPT 3.5 |
| The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24. | Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized. |
Mou et al, 202480 | Pathology reports from breast cancer patients at the University Hospital Aachen | GPT-4, Mixtral-8 × 7B |
| Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features. | GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios. |
Author and year . | Data source . | Chatbot with version . | Study objective and prompt formulation . | Model efficacy and accuracy . | Comparative performance and contextual evaluation . |
---|---|---|---|---|---|
Lyu et al, 202357 | Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database | GPT-3 and GPT-4 | Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions. | Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report. | ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score. |
Holmes et al, 202358 | Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist. | ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ | Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis. | ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials. | ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4. |
Rao et al, 202359 | Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified | ChatGPT (GPT-3.5 and GPT-4) |
|
| ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios. |
Sorin et al, 202360 | 10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic. | ChatGPT 3.5 |
| ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400. | ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance. |
Grünebaum et al, 202361 | 14 questions about obstetrics and gynaecology, conceived by 4 physicians | ChatGPT 3.5 |
| No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses. | No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data. |
Yeo et al, 202362 | 164 questions about cirrhosis and HCC | ChatGPT |
| High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations. | ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees. |
Zhu et al, 202363 | 22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors | ChatGPT-3.5, ChatGPT 4, and other LLMs |
| Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable. | No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs |
Sorin et al, 202364 | Clinical information of 10 consecutive patients from a breast tumour board | ChatGPT-3.5 |
| 70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement. | No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases. |
Chen et al, 202365 | 104 prompts on breast, prostate, and lung cancer based on NCCN guidelines | ChatGPT (gpt-3.5-turbo-0301) |
| ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type. | No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations. |
Nakamura et al, 202366 | MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports | GPT-3.5 Turbo, GPT-4 |
| GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category. | GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category. |
Truhn et al, 202467 | Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL | GPT-4 |
| GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision. | GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step. |
Sushil et al, 202468 | 769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021 | GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B |
| GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34. | GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance. |
Liang et al, 202469 | 80 RCC-related clinical questions, provided by urology experts | ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo |
| ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy. | ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training. |
Marchi et al, 202470 | 68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024 | ChatGPT-3.5 |
|
| ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions. |
Gu et al, 202471 | The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul. | GPT-4 (version gpt-4-0314) |
|
| GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases. |
Lee et al, 202372 | 84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai. | FastChat-T5 (3B-parameter LLM) |
| Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2. | The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions. |
Kuşcu et al, 202373 | 154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media. | ChatGPT Plus, based on GPT-4 (March 2023 version) |
| ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported. | ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications. |
Gibson et al, 202474 | The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends. | ChatGPT-4 |
| The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses. | ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed. |
Huang et al, 202475 | Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports. | ChatGPT-3.5-turbo-16k, GPT-4-turbo |
| ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%. | ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports. |
Huang et al, 202376 | The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection. | ChatGPT-3.5 and ChatGPT-4 |
|
| Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings. |
Dennstädt et al, 202377 | 70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists. | GPT-3.5-turbo |
|
| ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence. |
Choi et al, 202378 | Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports. | ChatGPT (GPT-3.5-turbo) |
| The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees. | LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients. |
Rydzewski et al, 202479 | 2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage. | GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1 |
| GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly. | Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models. |
Lee et al, 202472 | The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU. | ChatGPT 3.5 |
| The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24. | Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized. |
Mou et al, 202480 | Pathology reports from breast cancer patients at the University Hospital Aachen | GPT-4, Mixtral-8 × 7B |
| Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features. | GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios. |
This table facilitates a comprehensive understanding of the varying approaches in utilizing different chatbots, highlighting the data types and sources employed, specific objectives and prompt formulations for each study, and detailed insights into the efficacy and accuracy of the models. Additionally, it includes comparative analyses, providing context and benchmarking against other models or human experts, thereby offering a holistic view of the advancements and challenges in applying chatbots in cancer research.
Abbreviations: LLM = large language models; GPT = generative pretrained transformer; ACR = American College of Radiology; OE = open-ended; SATA = select all that apply; MDT = multidisciplinary tumour board; TNM = tumour, node, metastasis; HCC = hepatocellular carcinoma; NCCN = National Comprehensive Cancer Network; TCGA = The Cancer Genome Atlas; UCL = University College Hospitals; UCSF = University of California, San Francisco; OCR = optical character recognition; LSTM = long short-term memory; RCC = renal cell carcinoma; HNC = head and neck cancer; CDSA = Cancer Digital Slide Archive; NER = named entity recognition; ACR = American College of Radiology; HER2 = human epidermal growth factor receptor 2; Ki67 = a marker of cell proliferation; CDC = centers for disease control and prevention; LI-RADS = liver imaging reporting and data system; AJCC = american joint committee on cancer; PEMAT-AI = patient education materials assessment tool for AI; DISCERN-AI = a tool to help healthcare consumers and practitioners in evaluating the quality of healthcare treatment information; NLAT-AI = natural language assessment tool for AI; TXIT = in-training exam; CNS = central nervous system; ASCO = american society of clinical oncology; EAU = european association of urology.
Author and year . | Data source . | Chatbot with version . | Study objective and prompt formulation . | Model efficacy and accuracy . | Comparative performance and contextual evaluation . |
---|---|---|---|---|---|
Lyu et al, 202357 | Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database | GPT-3 and GPT-4 | Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions. | Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report. | ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score. |
Holmes et al, 202358 | Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist. | ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ | Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis. | ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials. | ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4. |
Rao et al, 202359 | Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified | ChatGPT (GPT-3.5 and GPT-4) |
|
| ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios. |
Sorin et al, 202360 | 10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic. | ChatGPT 3.5 |
| ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400. | ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance. |
Grünebaum et al, 202361 | 14 questions about obstetrics and gynaecology, conceived by 4 physicians | ChatGPT 3.5 |
| No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses. | No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data. |
Yeo et al, 202362 | 164 questions about cirrhosis and HCC | ChatGPT |
| High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations. | ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees. |
Zhu et al, 202363 | 22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors | ChatGPT-3.5, ChatGPT 4, and other LLMs |
| Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable. | No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs |
Sorin et al, 202364 | Clinical information of 10 consecutive patients from a breast tumour board | ChatGPT-3.5 |
| 70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement. | No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases. |
Chen et al, 202365 | 104 prompts on breast, prostate, and lung cancer based on NCCN guidelines | ChatGPT (gpt-3.5-turbo-0301) |
| ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type. | No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations. |
Nakamura et al, 202366 | MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports | GPT-3.5 Turbo, GPT-4 |
| GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category. | GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category. |
Truhn et al, 202467 | Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL | GPT-4 |
| GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision. | GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step. |
Sushil et al, 202468 | 769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021 | GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B |
| GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34. | GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance. |
Liang et al, 202469 | 80 RCC-related clinical questions, provided by urology experts | ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo |
| ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy. | ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training. |
Marchi et al, 202470 | 68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024 | ChatGPT-3.5 |
|
| ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions. |
Gu et al, 202471 | The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul. | GPT-4 (version gpt-4-0314) |
|
| GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases. |
Lee et al, 202372 | 84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai. | FastChat-T5 (3B-parameter LLM) |
| Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2. | The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions. |
Kuşcu et al, 202373 | 154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media. | ChatGPT Plus, based on GPT-4 (March 2023 version) |
| ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported. | ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications. |
Gibson et al, 202474 | The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends. | ChatGPT-4 |
| The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses. | ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed. |
Huang et al, 202475 | Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports. | ChatGPT-3.5-turbo-16k, GPT-4-turbo |
| ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%. | ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports. |
Huang et al, 202376 | The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection. | ChatGPT-3.5 and ChatGPT-4 |
|
| Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings. |
Dennstädt et al, 202377 | 70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists. | GPT-3.5-turbo |
|
| ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence. |
Choi et al, 202378 | Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports. | ChatGPT (GPT-3.5-turbo) |
| The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees. | LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients. |
Rydzewski et al, 202479 | 2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage. | GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1 |
| GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly. | Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models. |
Lee et al, 202472 | The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU. | ChatGPT 3.5 |
| The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24. | Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized. |
Mou et al, 202480 | Pathology reports from breast cancer patients at the University Hospital Aachen | GPT-4, Mixtral-8 × 7B |
| Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features. | GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios. |
Author and year . | Data source . | Chatbot with version . | Study objective and prompt formulation . | Model efficacy and accuracy . | Comparative performance and contextual evaluation . |
---|---|---|---|---|---|
Lyu et al, 202357 | Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database | GPT-3 and GPT-4 | Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions. | Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report. | ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score. |
Holmes et al, 202358 | Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist. | ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ | Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis. | ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials. | ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4. |
Rao et al, 202359 | Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified | ChatGPT (GPT-3.5 and GPT-4) |
|
| ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios. |
Sorin et al, 202360 | 10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic. | ChatGPT 3.5 |
| ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400. | ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance. |
Grünebaum et al, 202361 | 14 questions about obstetrics and gynaecology, conceived by 4 physicians | ChatGPT 3.5 |
| No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses. | No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data. |
Yeo et al, 202362 | 164 questions about cirrhosis and HCC | ChatGPT |
| High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations. | ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees. |
Zhu et al, 202363 | 22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors | ChatGPT-3.5, ChatGPT 4, and other LLMs |
| Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable. | No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs |
Sorin et al, 202364 | Clinical information of 10 consecutive patients from a breast tumour board | ChatGPT-3.5 |
| 70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement. | No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases. |
Chen et al, 202365 | 104 prompts on breast, prostate, and lung cancer based on NCCN guidelines | ChatGPT (gpt-3.5-turbo-0301) |
| ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type. | No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations. |
Nakamura et al, 202366 | MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports | GPT-3.5 Turbo, GPT-4 |
| GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category. | GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category. |
Truhn et al, 202467 | Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL | GPT-4 |
| GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision. | GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step. |
Sushil et al, 202468 | 769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021 | GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B |
| GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34. | GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance. |
Liang et al, 202469 | 80 RCC-related clinical questions, provided by urology experts | ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo |
| ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy. | ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training. |
Marchi et al, 202470 | 68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024 | ChatGPT-3.5 |
|
| ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions. |
Gu et al, 202471 | The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul. | GPT-4 (version gpt-4-0314) |
|
| GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases. |
Lee et al, 202372 | 84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai. | FastChat-T5 (3B-parameter LLM) |
| Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2. | The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions. |
Kuşcu et al, 202373 | 154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media. | ChatGPT Plus, based on GPT-4 (March 2023 version) |
| ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported. | ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications. |
Gibson et al, 202474 | The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends. | ChatGPT-4 |
| The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses. | ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed. |
Huang et al, 202475 | Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports. | ChatGPT-3.5-turbo-16k, GPT-4-turbo |
| ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%. | ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports. |
Huang et al, 202376 | The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection. | ChatGPT-3.5 and ChatGPT-4 |
|
| Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings. |
Dennstädt et al, 202377 | 70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists. | GPT-3.5-turbo |
|
| ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence. |
Choi et al, 202378 | Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports. | ChatGPT (GPT-3.5-turbo) |
| The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees. | LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients. |
Rydzewski et al, 202479 | 2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage. | GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1 |
| GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly. | Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models. |
Lee et al, 202472 | The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU. | ChatGPT 3.5 |
| The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24. | Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized. |
Mou et al, 202480 | Pathology reports from breast cancer patients at the University Hospital Aachen | GPT-4, Mixtral-8 × 7B |
| Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features. | GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios. |
This table facilitates a comprehensive understanding of the varying approaches in utilizing different chatbots, highlighting the data types and sources employed, specific objectives and prompt formulations for each study, and detailed insights into the efficacy and accuracy of the models. Additionally, it includes comparative analyses, providing context and benchmarking against other models or human experts, thereby offering a holistic view of the advancements and challenges in applying chatbots in cancer research.
Abbreviations: LLM = large language models; GPT = generative pretrained transformer; ACR = American College of Radiology; OE = open-ended; SATA = select all that apply; MDT = multidisciplinary tumour board; TNM = tumour, node, metastasis; HCC = hepatocellular carcinoma; NCCN = National Comprehensive Cancer Network; TCGA = The Cancer Genome Atlas; UCL = University College Hospitals; UCSF = University of California, San Francisco; OCR = optical character recognition; LSTM = long short-term memory; RCC = renal cell carcinoma; HNC = head and neck cancer; CDSA = Cancer Digital Slide Archive; NER = named entity recognition; ACR = American College of Radiology; HER2 = human epidermal growth factor receptor 2; Ki67 = a marker of cell proliferation; CDC = centers for disease control and prevention; LI-RADS = liver imaging reporting and data system; AJCC = american joint committee on cancer; PEMAT-AI = patient education materials assessment tool for AI; DISCERN-AI = a tool to help healthcare consumers and practitioners in evaluating the quality of healthcare treatment information; NLAT-AI = natural language assessment tool for AI; TXIT = in-training exam; CNS = central nervous system; ASCO = american society of clinical oncology; EAU = european association of urology.
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.