Table 3. Open in new tab Analysis of LLM...

Table 3.

Analysis of LLM chatbots in cancer research.

Author and year	Data source	Chatbot with version	Study objective and prompt formulation	Model efficacy and accuracy	Comparative performance and contextual evaluation
Lyu et al, 2023⁵⁷	Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database	GPT-3 and GPT-4	Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions.	Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report.	ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score.
Holmes et al, 2023⁵⁸	Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist.	ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ	Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis.	ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials.	ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4.
Rao et al, 2023⁵⁹	Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified	ChatGPT (GPT-3.5 and GPT-4)	Objective: Assess ChatGPT for radiologic decision support in breast cancer screening and breast pain. Prompts: OE: Determine the single most appropriate imaging procedure. SATA: Concisely assess the listed procedures’ appropriateness. Additional characteristics: Specificity: Yes Iterative prompting: No Context Provided: No Setting boundaries: No	OE for Breast Cancer Screening: Both ChatGPT-3.5 and ChatGPT-4: Average score of 1.830/2. SATA for Breast Cancer Screening: ChatGPT-3.5: 88.9% accuracy. ChatGPT-4: 98.4% accuracy. OE and SATA for Breast Pain: ChatGPT-3.5: OE score of 1.125/2, SATA accuracy of 58.3%. ChatGPT-4: OE score of 1.666/2, SATA accuracy of 77.7%.	ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios.
Sorin et al, 2023⁶⁰	10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic.	ChatGPT 3.5	Objective: Assess ChatGPT in therapy planning for early breast cancer in a MDT setting. Prompts: Inquire about the appropriate treatment approach for a patient with specific details regarding their breast cancer diagnosis. This includes age, TNM status, receptor expressions, HER2 status, Ki67 levels, tumour grading, and any relevant genetic mutations. Additional characteristics: Specificity: Yes Iterative prompting: No Context provided: Yes Setting boundaries: No Directiveness: No	ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400.	ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance.
Grünebaum et al, 2023⁶¹	14 questions about obstetrics and gynaecology, conceived by 4 physicians	ChatGPT 3.5	Objective: Evaluate ChatGPT’s responses to obstetrics and gynaecology questions. Prompts: Specific and context-provided questions without iterative prompting or boundaries.	No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses.	No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data.
Yeo et al, 2023⁶²	164 questions about cirrhosis and HCC	ChatGPT	Objective: Assess the accuracy and reproducibility of ChatGPT in answering cirrhosis and HCC-related questions. Prompts: Evaluated ChatGPT’s responses to 164 diverse questions, independently graded by 2 transplant hepatologists and resolved by a third reviewer. The questions encompassed various aspects of cirrhosis and HCC.	High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations.	ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees.
Zhu et al, 2023⁶³	22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors	ChatGPT-3.5, ChatGPT 4, and other LLMs	Objective: Assess the performance of ChatGPT and other LLMs in answering prostate cancer-related questions. Prompts: 22 questions covering screening, prevention, treatment, and postoperative complications. The formulation of these questions was intended to test the models’ capacity to provide accurate and comprehensive responses in a medical context.	Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable.	No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs
Sorin et al, 2023⁶⁴	Clinical information of 10 consecutive patients from a breast tumour board	ChatGPT-3.5	Objective: Evaluate ChatGPT-3.5 as a decision-support tool for breast tumour board decisions. Test the chatbot’s ability to process complex medical data and offer recommendations comparable to those of a professional tumour board. Prompts: clinical information of 10 consecutive patients presented at a breast tumour board was input into ChatGPT-3.5. Each patient’s case was detailed, and ChatGPT-3.5 was asked to provide management recommendations.	70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement.	No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases.
Chen et al, 2023⁶⁵	104 prompts on breast, prostate, and lung cancer based on NCCN guidelines	ChatGPT (gpt-3.5-turbo-0301)	Objective: Evaluate ChatGPT’s performance in providing cancer treatment recommendations aligned with NCCN guidelines. Aimed to explore how differences in query formulation affect ChatGPT’s responses. Prompts: Four zero-shot prompt templates were developed for 26 unique diagnosis descriptions (cancer types ± extent of disease modifiers relevant for each cancer) for a total of 104 prompts.	ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type.	No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations.
Nakamura et al, 2023⁶⁶	MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports	GPT-3.5 Turbo, GPT-4	Objective: Evaluate feasibility of ChatGPT to automate the assignment of clinical TNM stages from unstructured CT radiology reports, focusing on optimizing prompt design for better performance. Prompt: Prompt was developed, incorporating the TNM classification rule and stepwise instructions. ChatGPT was guided to extract and assign TNM stages, explain its reasoning, and provide the final output in a JSON-like format.	GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category.	GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category.
Truhn et al, 2024⁶⁷	Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL	GPT-4	Objective: Evaluate GPT-4’s ability to extract structured data from unstructured pathology reports and assess its accuracy. Prompt: GPT-4 was provided predefined prompts to extract specific information, such as TNM staging and lymph node involvement, from reports. In some instances, GPT-4 was also tasked with proposing structured reporting templates based on unstructured text.	GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision.	GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step.
Sushil et al, 2024⁶⁸	769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021	GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B	Objective: Evaluated zero-shot classification of breast cancer pathology reports, comparing them to supervised machine learning models like random forests, LSTM with attention, and UCSF-BERT. The goal was to assess if LLMs could reduce the need for large-scale data annotations in clinical NLP tasks. Prompt: LLMs were prompted to extract 12 categories of breast cancer pathology information, such as tumour margins, lymph node involvement, and HER2 status, providing single answers in JSON format.	GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34.	GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance.
Liang et al, 2024⁶⁹	80 RCC-related clinical questions, provided by urology experts	ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo	Objective: Evaluate ChatGPT’s efficacy in answering clinical questions related to renal oncology, focusing on performance improvement through fine-tuning. Prompt: Binary (yes/no) questions were posed to ChatGPT-3.5, ChatGPT-4.0, and a fine-tuned model, with each question repeated 3 times for consistency.	ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy.	ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training.
Marchi et al, 2024⁷⁰	68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024	ChatGPT-3.5	Objective: Evaluate ChatGPT’s performance in providing therapeutic recommendations for head and neck cancer compared to NCCN Guidelines, assessing its potential as an AI-assisted decision-making tool in oncology. Prompt: ChatGPT was asked about primary treatment, adjuvant treatment, and follow-up recommendations for hypothetical cases based on NCCN scenarios.	Primary treatment: ChatGPT achieved 85.3% accuracy, 100% sensitivity, and an F1 score of 0.92. Adjuvant treatment: It reached 95.59% accuracy, 100% sensitivity, and an F1 score of 0.96. Follow-up recommendations: ChatGPT attained 94.12% accuracy, 100% sensitivity, and an F1 score of 0.94.	ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions.
Gu et al, 2024⁷¹	The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul.	GPT-4 (version gpt-4-0314)	Objective: Assess GPT-4’s ability to extract LI-RADS features and categorize liver lesions from multilingual liver MRI reports to automate key radiological feature extraction. Prompt: A 2-step prompt translated, summarized, and extracted LI-RADS features, using Python-based rules to calculate the category. Prompts were iteratively refined based on performance.	Internal Test (fictitious reports): GPT-4 achieved accuracies of 0.97 to 1.00 for major features like size, nonrim arterial phase hyperenhancement, and threshold growth, with F1 scores of 0.95 to 0.98. External Test (real reports): Accuracy ranged from 0.92 to 0.99 for major LI-RADS features, with F1 scores between 0.89 and 0.94. The overall accuracy for determining the LI-RADS category was 0.85 (95% CI: 0.76, 0.93).	GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases.
Lee et al, 2023⁷²	84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai.	FastChat-T5 (3B-parameter LLM)	Objective: Assess the efficacy of privacy-preserving LLMs for automated extraction of staging and recurrence risk information from thyroid cancer surgical pathology reports while ensuring patient privacy. Prompt: The model was prompted with 12 expert-designed medical questions, aimed at extracting AJCC/TNM staging data, recurrence risk factors, and tumour characteristics.	Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2.	The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions.
Kuşcu et al, 2023⁷³	154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media.	ChatGPT Plus, based on GPT-4 (March 2023 version)	Objective: Assess the accuracy and reliability of ChatGPT’s responses to HNC questions, focusing on its potential use for patient education and clinical decision support. Prompt: Questions were entered into ChatGPT individually, and responses were evaluated twice for reproducibility. Two head and neck surgeons graded the answers for accuracy, categorizing them as comprehensive/correct, incomplete/partially correct, mixed, or completely inaccurate/irrelevant.	ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported.	ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications.
Gibson et al, 2024⁷⁴	The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends.	ChatGPT-4	Objective: Evaluate ChatGPT-4’s accuracy, quality, readability, and safety in answering prostate cancer-related questions, aiming to determine its potential as a patient education tool. Prompt: Eight structured prostate cancer questions, like “What are the symptoms of prostate cancer?” and “What are the pros and cons of treatment options?” were posed to ChatGPT-4, with a request for references in each response.	The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses.	ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed.
Huang et al, 2024⁷⁵	Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports.	ChatGPT-3.5-turbo-16k, GPT-4-turbo	Objective: Assess ChatGPT’s ability to extract structured data (eg, tumour staging, histological diagnosis) from free-text clinical notes. Prompt: ChatGPT extracted tumour features (pT), lymph node involvement (pN), overall stage, and diagnosis, following AJCC seventh edition guidelines, outputting results in JSON format.	ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%.	ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports.
Huang et al, 2023⁷⁶	The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection.	ChatGPT-3.5 and ChatGPT-4	Objective: Benchmark ChatGPT-4’s performance on the TXIT examination and Gray Zone clinical cases to evaluate the potential of LLMs in medical education and clinical decision-making in radiation oncology. Prompt: For the TXIT examination, the models provided correct answers without justification. For Gray Zone cases, ChatGPT-4 was prompted as an “expert radiation oncologist” to provide diagnoses and treatment plans.	TXIT exam: ChatGPT-4 achieved 78.77% accuracy, outperforming ChatGPT-3.5’s 62.05%. ChatGPT-4 excelled in areas like statistics, CNS and eye, biology, and physics but struggled with bone and soft tissue and gynaecology topics. Gray Zone Cases: ChatGPT-4 provided highly correct and comprehensive responses, earning a correctness score of 3.5 and a comprehensiveness score of 3.1 (out of 4), as rated by clinical experts.	Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings.
Dennstädt et al, 2023⁷⁷	70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists.	GPT-3.5-turbo	Objective: To assess ChatGPT’s ability to answer radiation oncology-specific multiple-choice and OE questions, evaluating its accuracy and usefulness in this specialized field. Prompt: ChatGPT provided the correct letter (A, B, C, or D) for multiple-choice questions, while radiation oncologists rated its OE responses on correctness and usefulness using a Likert scale.	Multiple-choice questions: ChatGPT provided valid answers for 66 of 70 questions (94.3%), with 60.61% correct. It performed best on physics questions (78.57% accuracy), followed by biology (58.33%) and clinical questions (50.0%). OE questions: 12 of 25 answers were rated as “acceptable”, “good”, or “very good” by all 6 radiation oncologists. Correctness scores ranged from 1.50 to 5.00 on a Likert scale, with a mean of 3.49.	ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence.
Choi et al, 2023⁷⁸	Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports.	ChatGPT (GPT-3.5-turbo)	Objective: Develop and evaluate LLM prompts for extracting clinical data from breast cancer reports, comparing accuracy, time, and cost to manual methods. Prompt: Twelve prompts targeted factors like tumour size, lymphovascular invasion, and surgery type, extracting and formatting structured and free-text data for analysis.	The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees.	LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients.
Rydzewski et al, 2024⁷⁹	2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage.	GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	Objective: Assess LLMs’ accuracy, confidence, and consistency in answering oncology multiple-choice questions to ensure safe clinical use. Prompt: Models answered questions, provided confidence scores (1-4), and explanations, with prompts repeated 3 times to evaluate consistency. GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly.	Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models.
Lee et al, 2024⁷²	The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU.	ChatGPT 3.5	Objective: AI-based chatbot that can provide accurate and real-time medical information to cancer patients. Prompt: The chatbot used a fixed prompt to ensure consistency in responses, with the temperature parameter set to 0.1 to generate accurate and reliable answers.	The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24.	Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized.
Mou et al, 2024⁸⁰	Pathology reports from breast cancer patients at the University Hospital Aachen	GPT-4, Mixtral-8 × 7B	Objective: Transform unstructured pathology reports into a structured format for tumour documentation in line with the German Basic Oncology Dataset. Prompt: LLMs were tasked with extracting key data points from pathology reports, including diagnosis and tumour localization, and converting the information into structured formats according to a predefined data model.	Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features.	GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios.

Author and year	Data source	Chatbot with version	Study objective and prompt formulation	Model efficacy and accuracy	Comparative performance and contextual evaluation
Lyu et al, 2023⁵⁷	Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database	GPT-3 and GPT-4	Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions.	Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report.	ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score.
Holmes et al, 2023⁵⁸	Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist.	ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ	Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis.	ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials.	ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4.
Rao et al, 2023⁵⁹	Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified	ChatGPT (GPT-3.5 and GPT-4)	Objective: Assess ChatGPT for radiologic decision support in breast cancer screening and breast pain. Prompts: OE: Determine the single most appropriate imaging procedure. SATA: Concisely assess the listed procedures’ appropriateness. Additional characteristics: Specificity: Yes Iterative prompting: No Context Provided: No Setting boundaries: No	OE for Breast Cancer Screening: Both ChatGPT-3.5 and ChatGPT-4: Average score of 1.830/2. SATA for Breast Cancer Screening: ChatGPT-3.5: 88.9% accuracy. ChatGPT-4: 98.4% accuracy. OE and SATA for Breast Pain: ChatGPT-3.5: OE score of 1.125/2, SATA accuracy of 58.3%. ChatGPT-4: OE score of 1.666/2, SATA accuracy of 77.7%.	ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios.
Sorin et al, 2023⁶⁰	10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic.	ChatGPT 3.5	Objective: Assess ChatGPT in therapy planning for early breast cancer in a MDT setting. Prompts: Inquire about the appropriate treatment approach for a patient with specific details regarding their breast cancer diagnosis. This includes age, TNM status, receptor expressions, HER2 status, Ki67 levels, tumour grading, and any relevant genetic mutations. Additional characteristics: Specificity: Yes Iterative prompting: No Context provided: Yes Setting boundaries: No Directiveness: No	ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400.	ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance.
Grünebaum et al, 2023⁶¹	14 questions about obstetrics and gynaecology, conceived by 4 physicians	ChatGPT 3.5	Objective: Evaluate ChatGPT’s responses to obstetrics and gynaecology questions. Prompts: Specific and context-provided questions without iterative prompting or boundaries.	No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses.	No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data.
Yeo et al, 2023⁶²	164 questions about cirrhosis and HCC	ChatGPT	Objective: Assess the accuracy and reproducibility of ChatGPT in answering cirrhosis and HCC-related questions. Prompts: Evaluated ChatGPT’s responses to 164 diverse questions, independently graded by 2 transplant hepatologists and resolved by a third reviewer. The questions encompassed various aspects of cirrhosis and HCC.	High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations.	ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees.
Zhu et al, 2023⁶³	22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors	ChatGPT-3.5, ChatGPT 4, and other LLMs	Objective: Assess the performance of ChatGPT and other LLMs in answering prostate cancer-related questions. Prompts: 22 questions covering screening, prevention, treatment, and postoperative complications. The formulation of these questions was intended to test the models’ capacity to provide accurate and comprehensive responses in a medical context.	Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable.	No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs
Sorin et al, 2023⁶⁴	Clinical information of 10 consecutive patients from a breast tumour board	ChatGPT-3.5	Objective: Evaluate ChatGPT-3.5 as a decision-support tool for breast tumour board decisions. Test the chatbot’s ability to process complex medical data and offer recommendations comparable to those of a professional tumour board. Prompts: clinical information of 10 consecutive patients presented at a breast tumour board was input into ChatGPT-3.5. Each patient’s case was detailed, and ChatGPT-3.5 was asked to provide management recommendations.	70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement.	No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases.
Chen et al, 2023⁶⁵	104 prompts on breast, prostate, and lung cancer based on NCCN guidelines	ChatGPT (gpt-3.5-turbo-0301)	Objective: Evaluate ChatGPT’s performance in providing cancer treatment recommendations aligned with NCCN guidelines. Aimed to explore how differences in query formulation affect ChatGPT’s responses. Prompts: Four zero-shot prompt templates were developed for 26 unique diagnosis descriptions (cancer types ± extent of disease modifiers relevant for each cancer) for a total of 104 prompts.	ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type.	No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations.
Nakamura et al, 2023⁶⁶	MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports	GPT-3.5 Turbo, GPT-4	Objective: Evaluate feasibility of ChatGPT to automate the assignment of clinical TNM stages from unstructured CT radiology reports, focusing on optimizing prompt design for better performance. Prompt: Prompt was developed, incorporating the TNM classification rule and stepwise instructions. ChatGPT was guided to extract and assign TNM stages, explain its reasoning, and provide the final output in a JSON-like format.	GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category.	GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category.
Truhn et al, 2024⁶⁷	Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL	GPT-4	Objective: Evaluate GPT-4’s ability to extract structured data from unstructured pathology reports and assess its accuracy. Prompt: GPT-4 was provided predefined prompts to extract specific information, such as TNM staging and lymph node involvement, from reports. In some instances, GPT-4 was also tasked with proposing structured reporting templates based on unstructured text.	GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision.	GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step.
Sushil et al, 2024⁶⁸	769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021	GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B	Objective: Evaluated zero-shot classification of breast cancer pathology reports, comparing them to supervised machine learning models like random forests, LSTM with attention, and UCSF-BERT. The goal was to assess if LLMs could reduce the need for large-scale data annotations in clinical NLP tasks. Prompt: LLMs were prompted to extract 12 categories of breast cancer pathology information, such as tumour margins, lymph node involvement, and HER2 status, providing single answers in JSON format.	GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34.	GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance.
Liang et al, 2024⁶⁹	80 RCC-related clinical questions, provided by urology experts	ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo	Objective: Evaluate ChatGPT’s efficacy in answering clinical questions related to renal oncology, focusing on performance improvement through fine-tuning. Prompt: Binary (yes/no) questions were posed to ChatGPT-3.5, ChatGPT-4.0, and a fine-tuned model, with each question repeated 3 times for consistency.	ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy.	ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training.
Marchi et al, 2024⁷⁰	68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024	ChatGPT-3.5	Objective: Evaluate ChatGPT’s performance in providing therapeutic recommendations for head and neck cancer compared to NCCN Guidelines, assessing its potential as an AI-assisted decision-making tool in oncology. Prompt: ChatGPT was asked about primary treatment, adjuvant treatment, and follow-up recommendations for hypothetical cases based on NCCN scenarios.	Primary treatment: ChatGPT achieved 85.3% accuracy, 100% sensitivity, and an F1 score of 0.92. Adjuvant treatment: It reached 95.59% accuracy, 100% sensitivity, and an F1 score of 0.96. Follow-up recommendations: ChatGPT attained 94.12% accuracy, 100% sensitivity, and an F1 score of 0.94.	ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions.
Gu et al, 2024⁷¹	The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul.	GPT-4 (version gpt-4-0314)	Objective: Assess GPT-4’s ability to extract LI-RADS features and categorize liver lesions from multilingual liver MRI reports to automate key radiological feature extraction. Prompt: A 2-step prompt translated, summarized, and extracted LI-RADS features, using Python-based rules to calculate the category. Prompts were iteratively refined based on performance.	Internal Test (fictitious reports): GPT-4 achieved accuracies of 0.97 to 1.00 for major features like size, nonrim arterial phase hyperenhancement, and threshold growth, with F1 scores of 0.95 to 0.98. External Test (real reports): Accuracy ranged from 0.92 to 0.99 for major LI-RADS features, with F1 scores between 0.89 and 0.94. The overall accuracy for determining the LI-RADS category was 0.85 (95% CI: 0.76, 0.93).	GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases.
Lee et al, 2023⁷²	84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai.	FastChat-T5 (3B-parameter LLM)	Objective: Assess the efficacy of privacy-preserving LLMs for automated extraction of staging and recurrence risk information from thyroid cancer surgical pathology reports while ensuring patient privacy. Prompt: The model was prompted with 12 expert-designed medical questions, aimed at extracting AJCC/TNM staging data, recurrence risk factors, and tumour characteristics.	Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2.	The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions.
Kuşcu et al, 2023⁷³	154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media.	ChatGPT Plus, based on GPT-4 (March 2023 version)	Objective: Assess the accuracy and reliability of ChatGPT’s responses to HNC questions, focusing on its potential use for patient education and clinical decision support. Prompt: Questions were entered into ChatGPT individually, and responses were evaluated twice for reproducibility. Two head and neck surgeons graded the answers for accuracy, categorizing them as comprehensive/correct, incomplete/partially correct, mixed, or completely inaccurate/irrelevant.	ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported.	ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications.
Gibson et al, 2024⁷⁴	The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends.	ChatGPT-4	Objective: Evaluate ChatGPT-4’s accuracy, quality, readability, and safety in answering prostate cancer-related questions, aiming to determine its potential as a patient education tool. Prompt: Eight structured prostate cancer questions, like “What are the symptoms of prostate cancer?” and “What are the pros and cons of treatment options?” were posed to ChatGPT-4, with a request for references in each response.	The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses.	ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed.
Huang et al, 2024⁷⁵	Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports.	ChatGPT-3.5-turbo-16k, GPT-4-turbo	Objective: Assess ChatGPT’s ability to extract structured data (eg, tumour staging, histological diagnosis) from free-text clinical notes. Prompt: ChatGPT extracted tumour features (pT), lymph node involvement (pN), overall stage, and diagnosis, following AJCC seventh edition guidelines, outputting results in JSON format.	ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%.	ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports.
Huang et al, 2023⁷⁶	The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection.	ChatGPT-3.5 and ChatGPT-4	Objective: Benchmark ChatGPT-4’s performance on the TXIT examination and Gray Zone clinical cases to evaluate the potential of LLMs in medical education and clinical decision-making in radiation oncology. Prompt: For the TXIT examination, the models provided correct answers without justification. For Gray Zone cases, ChatGPT-4 was prompted as an “expert radiation oncologist” to provide diagnoses and treatment plans.	TXIT exam: ChatGPT-4 achieved 78.77% accuracy, outperforming ChatGPT-3.5’s 62.05%. ChatGPT-4 excelled in areas like statistics, CNS and eye, biology, and physics but struggled with bone and soft tissue and gynaecology topics. Gray Zone Cases: ChatGPT-4 provided highly correct and comprehensive responses, earning a correctness score of 3.5 and a comprehensiveness score of 3.1 (out of 4), as rated by clinical experts.	Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings.
Dennstädt et al, 2023⁷⁷	70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists.	GPT-3.5-turbo	Objective: To assess ChatGPT’s ability to answer radiation oncology-specific multiple-choice and OE questions, evaluating its accuracy and usefulness in this specialized field. Prompt: ChatGPT provided the correct letter (A, B, C, or D) for multiple-choice questions, while radiation oncologists rated its OE responses on correctness and usefulness using a Likert scale.	Multiple-choice questions: ChatGPT provided valid answers for 66 of 70 questions (94.3%), with 60.61% correct. It performed best on physics questions (78.57% accuracy), followed by biology (58.33%) and clinical questions (50.0%). OE questions: 12 of 25 answers were rated as “acceptable”, “good”, or “very good” by all 6 radiation oncologists. Correctness scores ranged from 1.50 to 5.00 on a Likert scale, with a mean of 3.49.	ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence.
Choi et al, 2023⁷⁸	Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports.	ChatGPT (GPT-3.5-turbo)	Objective: Develop and evaluate LLM prompts for extracting clinical data from breast cancer reports, comparing accuracy, time, and cost to manual methods. Prompt: Twelve prompts targeted factors like tumour size, lymphovascular invasion, and surgery type, extracting and formatting structured and free-text data for analysis.	The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees.	LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients.
Rydzewski et al, 2024⁷⁹	2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage.	GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	Objective: Assess LLMs’ accuracy, confidence, and consistency in answering oncology multiple-choice questions to ensure safe clinical use. Prompt: Models answered questions, provided confidence scores (1-4), and explanations, with prompts repeated 3 times to evaluate consistency. GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly.	Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models.
Lee et al, 2024⁷²	The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU.	ChatGPT 3.5	Objective: AI-based chatbot that can provide accurate and real-time medical information to cancer patients. Prompt: The chatbot used a fixed prompt to ensure consistency in responses, with the temperature parameter set to 0.1 to generate accurate and reliable answers.	The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24.	Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized.
Mou et al, 2024⁸⁰	Pathology reports from breast cancer patients at the University Hospital Aachen	GPT-4, Mixtral-8 × 7B	Objective: Transform unstructured pathology reports into a structured format for tumour documentation in line with the German Basic Oncology Dataset. Prompt: LLMs were tasked with extracting key data points from pathology reports, including diagnosis and tumour localization, and converting the information into structured formats according to a predefined data model.	Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features.	GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios.

This table facilitates a comprehensive understanding of the varying approaches in utilizing different chatbots, highlighting the data types and sources employed, specific objectives and prompt formulations for each study, and detailed insights into the efficacy and accuracy of the models. Additionally, it includes comparative analyses, providing context and benchmarking against other models or human experts, thereby offering a holistic view of the advancements and challenges in applying chatbots in cancer research.

Abbreviations: LLM = large language models; GPT = generative pretrained transformer; ACR = American College of Radiology; OE = open-ended; SATA = select all that apply; MDT = multidisciplinary tumour board; TNM = tumour, node, metastasis; HCC = hepatocellular carcinoma; NCCN = National Comprehensive Cancer Network; TCGA = The Cancer Genome Atlas; UCL = University College Hospitals; UCSF = University of California, San Francisco; OCR = optical character recognition; LSTM = long short-term memory; RCC = renal cell carcinoma; HNC = head and neck cancer; CDSA = Cancer Digital Slide Archive; NER = named entity recognition; ACR = American College of Radiology; HER2 = human epidermal growth factor receptor 2; Ki67 = a marker of cell proliferation; CDC = centers for disease control and prevention; LI-RADS = liver imaging reporting and data system; AJCC = american joint committee on cancer; PEMAT-AI = patient education materials assessment tool for AI; DISCERN-AI = a tool to help healthcare consumers and practitioners in evaluating the quality of healthcare treatment information; NLAT-AI = natural language assessment tool for AI; TXIT = in-training exam; CNS = central nervous system; ASCO = american society of clinical oncology; EAU = european association of urology.

Table 3.

Open in new tab

Analysis of LLM chatbots in cancer research.

Author and year	Data source	Chatbot with version	Study objective and prompt formulation	Model efficacy and accuracy	Comparative performance and contextual evaluation
Lyu et al, 2023⁵⁷	Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database	GPT-3 and GPT-4	Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions.	Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report.	ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score.
Holmes et al, 2023⁵⁸	Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist.	ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ	Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis.	ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials.	ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4.
Rao et al, 2023⁵⁹	Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified	ChatGPT (GPT-3.5 and GPT-4)	Objective: Assess ChatGPT for radiologic decision support in breast cancer screening and breast pain. Prompts: OE: Determine the single most appropriate imaging procedure. SATA: Concisely assess the listed procedures’ appropriateness. Additional characteristics: Specificity: Yes Iterative prompting: No Context Provided: No Setting boundaries: No	OE for Breast Cancer Screening: Both ChatGPT-3.5 and ChatGPT-4: Average score of 1.830/2. SATA for Breast Cancer Screening: ChatGPT-3.5: 88.9% accuracy. ChatGPT-4: 98.4% accuracy. OE and SATA for Breast Pain: ChatGPT-3.5: OE score of 1.125/2, SATA accuracy of 58.3%. ChatGPT-4: OE score of 1.666/2, SATA accuracy of 77.7%.	ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios.
Sorin et al, 2023⁶⁰	10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic.	ChatGPT 3.5	Objective: Assess ChatGPT in therapy planning for early breast cancer in a MDT setting. Prompts: Inquire about the appropriate treatment approach for a patient with specific details regarding their breast cancer diagnosis. This includes age, TNM status, receptor expressions, HER2 status, Ki67 levels, tumour grading, and any relevant genetic mutations. Additional characteristics: Specificity: Yes Iterative prompting: No Context provided: Yes Setting boundaries: No Directiveness: No	ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400.	ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance.
Grünebaum et al, 2023⁶¹	14 questions about obstetrics and gynaecology, conceived by 4 physicians	ChatGPT 3.5	Objective: Evaluate ChatGPT’s responses to obstetrics and gynaecology questions. Prompts: Specific and context-provided questions without iterative prompting or boundaries.	No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses.	No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data.
Yeo et al, 2023⁶²	164 questions about cirrhosis and HCC	ChatGPT	Objective: Assess the accuracy and reproducibility of ChatGPT in answering cirrhosis and HCC-related questions. Prompts: Evaluated ChatGPT’s responses to 164 diverse questions, independently graded by 2 transplant hepatologists and resolved by a third reviewer. The questions encompassed various aspects of cirrhosis and HCC.	High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations.	ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees.
Zhu et al, 2023⁶³	22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors	ChatGPT-3.5, ChatGPT 4, and other LLMs	Objective: Assess the performance of ChatGPT and other LLMs in answering prostate cancer-related questions. Prompts: 22 questions covering screening, prevention, treatment, and postoperative complications. The formulation of these questions was intended to test the models’ capacity to provide accurate and comprehensive responses in a medical context.	Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable.	No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs
Sorin et al, 2023⁶⁴	Clinical information of 10 consecutive patients from a breast tumour board	ChatGPT-3.5	Objective: Evaluate ChatGPT-3.5 as a decision-support tool for breast tumour board decisions. Test the chatbot’s ability to process complex medical data and offer recommendations comparable to those of a professional tumour board. Prompts: clinical information of 10 consecutive patients presented at a breast tumour board was input into ChatGPT-3.5. Each patient’s case was detailed, and ChatGPT-3.5 was asked to provide management recommendations.	70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement.	No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases.
Chen et al, 2023⁶⁵	104 prompts on breast, prostate, and lung cancer based on NCCN guidelines	ChatGPT (gpt-3.5-turbo-0301)	Objective: Evaluate ChatGPT’s performance in providing cancer treatment recommendations aligned with NCCN guidelines. Aimed to explore how differences in query formulation affect ChatGPT’s responses. Prompts: Four zero-shot prompt templates were developed for 26 unique diagnosis descriptions (cancer types ± extent of disease modifiers relevant for each cancer) for a total of 104 prompts.	ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type.	No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations.
Nakamura et al, 2023⁶⁶	MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports	GPT-3.5 Turbo, GPT-4	Objective: Evaluate feasibility of ChatGPT to automate the assignment of clinical TNM stages from unstructured CT radiology reports, focusing on optimizing prompt design for better performance. Prompt: Prompt was developed, incorporating the TNM classification rule and stepwise instructions. ChatGPT was guided to extract and assign TNM stages, explain its reasoning, and provide the final output in a JSON-like format.	GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category.	GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category.
Truhn et al, 2024⁶⁷	Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL	GPT-4	Objective: Evaluate GPT-4’s ability to extract structured data from unstructured pathology reports and assess its accuracy. Prompt: GPT-4 was provided predefined prompts to extract specific information, such as TNM staging and lymph node involvement, from reports. In some instances, GPT-4 was also tasked with proposing structured reporting templates based on unstructured text.	GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision.	GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step.
Sushil et al, 2024⁶⁸	769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021	GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B	Objective: Evaluated zero-shot classification of breast cancer pathology reports, comparing them to supervised machine learning models like random forests, LSTM with attention, and UCSF-BERT. The goal was to assess if LLMs could reduce the need for large-scale data annotations in clinical NLP tasks. Prompt: LLMs were prompted to extract 12 categories of breast cancer pathology information, such as tumour margins, lymph node involvement, and HER2 status, providing single answers in JSON format.	GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34.	GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance.
Liang et al, 2024⁶⁹	80 RCC-related clinical questions, provided by urology experts	ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo	Objective: Evaluate ChatGPT’s efficacy in answering clinical questions related to renal oncology, focusing on performance improvement through fine-tuning. Prompt: Binary (yes/no) questions were posed to ChatGPT-3.5, ChatGPT-4.0, and a fine-tuned model, with each question repeated 3 times for consistency.	ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy.	ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training.
Marchi et al, 2024⁷⁰	68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024	ChatGPT-3.5	Objective: Evaluate ChatGPT’s performance in providing therapeutic recommendations for head and neck cancer compared to NCCN Guidelines, assessing its potential as an AI-assisted decision-making tool in oncology. Prompt: ChatGPT was asked about primary treatment, adjuvant treatment, and follow-up recommendations for hypothetical cases based on NCCN scenarios.	Primary treatment: ChatGPT achieved 85.3% accuracy, 100% sensitivity, and an F1 score of 0.92. Adjuvant treatment: It reached 95.59% accuracy, 100% sensitivity, and an F1 score of 0.96. Follow-up recommendations: ChatGPT attained 94.12% accuracy, 100% sensitivity, and an F1 score of 0.94.	ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions.
Gu et al, 2024⁷¹	The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul.	GPT-4 (version gpt-4-0314)	Objective: Assess GPT-4’s ability to extract LI-RADS features and categorize liver lesions from multilingual liver MRI reports to automate key radiological feature extraction. Prompt: A 2-step prompt translated, summarized, and extracted LI-RADS features, using Python-based rules to calculate the category. Prompts were iteratively refined based on performance.	Internal Test (fictitious reports): GPT-4 achieved accuracies of 0.97 to 1.00 for major features like size, nonrim arterial phase hyperenhancement, and threshold growth, with F1 scores of 0.95 to 0.98. External Test (real reports): Accuracy ranged from 0.92 to 0.99 for major LI-RADS features, with F1 scores between 0.89 and 0.94. The overall accuracy for determining the LI-RADS category was 0.85 (95% CI: 0.76, 0.93).	GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases.
Lee et al, 2023⁷²	84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai.	FastChat-T5 (3B-parameter LLM)	Objective: Assess the efficacy of privacy-preserving LLMs for automated extraction of staging and recurrence risk information from thyroid cancer surgical pathology reports while ensuring patient privacy. Prompt: The model was prompted with 12 expert-designed medical questions, aimed at extracting AJCC/TNM staging data, recurrence risk factors, and tumour characteristics.	Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2.	The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions.
Kuşcu et al, 2023⁷³	154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media.	ChatGPT Plus, based on GPT-4 (March 2023 version)	Objective: Assess the accuracy and reliability of ChatGPT’s responses to HNC questions, focusing on its potential use for patient education and clinical decision support. Prompt: Questions were entered into ChatGPT individually, and responses were evaluated twice for reproducibility. Two head and neck surgeons graded the answers for accuracy, categorizing them as comprehensive/correct, incomplete/partially correct, mixed, or completely inaccurate/irrelevant.	ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported.	ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications.
Gibson et al, 2024⁷⁴	The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends.	ChatGPT-4	Objective: Evaluate ChatGPT-4’s accuracy, quality, readability, and safety in answering prostate cancer-related questions, aiming to determine its potential as a patient education tool. Prompt: Eight structured prostate cancer questions, like “What are the symptoms of prostate cancer?” and “What are the pros and cons of treatment options?” were posed to ChatGPT-4, with a request for references in each response.	The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses.	ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed.
Huang et al, 2024⁷⁵	Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports.	ChatGPT-3.5-turbo-16k, GPT-4-turbo	Objective: Assess ChatGPT’s ability to extract structured data (eg, tumour staging, histological diagnosis) from free-text clinical notes. Prompt: ChatGPT extracted tumour features (pT), lymph node involvement (pN), overall stage, and diagnosis, following AJCC seventh edition guidelines, outputting results in JSON format.	ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%.	ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports.
Huang et al, 2023⁷⁶	The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection.	ChatGPT-3.5 and ChatGPT-4	Objective: Benchmark ChatGPT-4’s performance on the TXIT examination and Gray Zone clinical cases to evaluate the potential of LLMs in medical education and clinical decision-making in radiation oncology. Prompt: For the TXIT examination, the models provided correct answers without justification. For Gray Zone cases, ChatGPT-4 was prompted as an “expert radiation oncologist” to provide diagnoses and treatment plans.	TXIT exam: ChatGPT-4 achieved 78.77% accuracy, outperforming ChatGPT-3.5’s 62.05%. ChatGPT-4 excelled in areas like statistics, CNS and eye, biology, and physics but struggled with bone and soft tissue and gynaecology topics. Gray Zone Cases: ChatGPT-4 provided highly correct and comprehensive responses, earning a correctness score of 3.5 and a comprehensiveness score of 3.1 (out of 4), as rated by clinical experts.	Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings.
Dennstädt et al, 2023⁷⁷	70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists.	GPT-3.5-turbo	Objective: To assess ChatGPT’s ability to answer radiation oncology-specific multiple-choice and OE questions, evaluating its accuracy and usefulness in this specialized field. Prompt: ChatGPT provided the correct letter (A, B, C, or D) for multiple-choice questions, while radiation oncologists rated its OE responses on correctness and usefulness using a Likert scale.	Multiple-choice questions: ChatGPT provided valid answers for 66 of 70 questions (94.3%), with 60.61% correct. It performed best on physics questions (78.57% accuracy), followed by biology (58.33%) and clinical questions (50.0%). OE questions: 12 of 25 answers were rated as “acceptable”, “good”, or “very good” by all 6 radiation oncologists. Correctness scores ranged from 1.50 to 5.00 on a Likert scale, with a mean of 3.49.	ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence.
Choi et al, 2023⁷⁸	Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports.	ChatGPT (GPT-3.5-turbo)	Objective: Develop and evaluate LLM prompts for extracting clinical data from breast cancer reports, comparing accuracy, time, and cost to manual methods. Prompt: Twelve prompts targeted factors like tumour size, lymphovascular invasion, and surgery type, extracting and formatting structured and free-text data for analysis.	The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees.	LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients.
Rydzewski et al, 2024⁷⁹	2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage.	GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	Objective: Assess LLMs’ accuracy, confidence, and consistency in answering oncology multiple-choice questions to ensure safe clinical use. Prompt: Models answered questions, provided confidence scores (1-4), and explanations, with prompts repeated 3 times to evaluate consistency. GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly.	Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models.
Lee et al, 2024⁷²	The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU.	ChatGPT 3.5	Objective: AI-based chatbot that can provide accurate and real-time medical information to cancer patients. Prompt: The chatbot used a fixed prompt to ensure consistency in responses, with the temperature parameter set to 0.1 to generate accurate and reliable answers.	The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24.	Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized.
Mou et al, 2024⁸⁰	Pathology reports from breast cancer patients at the University Hospital Aachen	GPT-4, Mixtral-8 × 7B	Objective: Transform unstructured pathology reports into a structured format for tumour documentation in line with the German Basic Oncology Dataset. Prompt: LLMs were tasked with extracting key data points from pathology reports, including diagnosis and tumour localization, and converting the information into structured formats according to a predefined data model.	Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features.	GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios.

Author and year	Data source	Chatbot with version	Study objective and prompt formulation	Model efficacy and accuracy	Comparative performance and contextual evaluation
Lyu et al, 2023⁵⁷	Radiology reports: 62 chest CT lung cancer scans, 76 brain MRI metastases scans from the Atrium Health Wake Forest Baptist clinical database	GPT-3 and GPT-4	Objective: To translate radiology reports into plain language for education of patients and healthcare providers. Prompts: translate the report into plain language, provide patient suggestions, provide healthcare provider suggestions.	Average score: 4.268/5. Instances of missing information: 0.080/report. Incorrect information: 0.065/report.	ChatGPT (original prompt): 55.2% accuracy. ChatGPT (optimized prompt): 77.2%. GPT-4 (original prompt): 73.6%. GPT-4 (optimized prompt): 96.8%. Human verification: Two radiologists’ evaluations included focusing on completeness, correctness, and overall score.
Holmes et al, 2023⁵⁸	Radiation oncology physics 100-question multiple-choice examination developed by an experienced medical physicist.	ChatGPT (GPT-3.5, GPT-4), Bard (LaMDA), BLOOMZ	Objective: Evaluate LLMs in answering specialized radiation oncology physics questions. Prompts: ChatGPT (GPT-4) was specifically tested with 2 approaches: explaining before answering and a novel approach to evaluate deductive reasoning by altering answer choices. Performance was compared individually and in a majority vote analysis.	ChatGPT GPT-4: Achieved a 67% accuracy rate in question responses with a stable 14% error rate in each trial. Displayed the highest accuracy among tested models, particularly effective when prompted for explanations before responding. Consistently high performance and notable deductive reasoning skills observed across multiple trials.	ChatGPT surpassed other LLMs as well as medical physicists. Yet, in a majority vote scenario, a collective of medical physicists demonstrated superior performance compared to ChatGPT GPT-4.
Rao et al, 2023⁵⁹	Breast cancer screening and breast pain cases (ACR Appropriateness Criteria); Data size not specified	ChatGPT (GPT-3.5 and GPT-4)	Objective: Assess ChatGPT for radiologic decision support in breast cancer screening and breast pain. Prompts: OE: Determine the single most appropriate imaging procedure. SATA: Concisely assess the listed procedures’ appropriateness. Additional characteristics: Specificity: Yes Iterative prompting: No Context Provided: No Setting boundaries: No	OE for Breast Cancer Screening: Both ChatGPT-3.5 and ChatGPT-4: Average score of 1.830/2. SATA for Breast Cancer Screening: ChatGPT-3.5: 88.9% accuracy. ChatGPT-4: 98.4% accuracy. OE and SATA for Breast Pain: ChatGPT-3.5: OE score of 1.125/2, SATA accuracy of 58.3%. ChatGPT-4: OE score of 1.666/2, SATA accuracy of 77.7%.	ChatGPT-4 significantly improved over ChatGPT-3.5 in decision-making accuracy for both clinical scenarios.
Sorin et al, 2023⁶⁰	10 consecutive early breast cancer cases from MDT discussions, January 2023, at clinic.	ChatGPT 3.5	Objective: Assess ChatGPT in therapy planning for early breast cancer in a MDT setting. Prompts: Inquire about the appropriate treatment approach for a patient with specific details regarding their breast cancer diagnosis. This includes age, TNM status, receptor expressions, HER2 status, Ki67 levels, tumour grading, and any relevant genetic mutations. Additional characteristics: Specificity: Yes Iterative prompting: No Context provided: Yes Setting boundaries: No Directiveness: No	ChatGPT’s recommendations achieved a 16.05% alignment with the MDT, scoring an average of 64.2 out of 400 with a congruence range from 0 to 400.	ChatGPT predominantly offered general treatment modalities and accurately identified risk factors for hereditary breast cancer. However, it sometimes provided incorrect therapy recommendations. Its responses were benchmarked against the MDT recommendations to calculate a clinical score of agreement for determining the level of concordance.
Grünebaum et al, 2023⁶¹	14 questions about obstetrics and gynaecology, conceived by 4 physicians	ChatGPT 3.5	Objective: Evaluate ChatGPT’s responses to obstetrics and gynaecology questions. Prompts: Specific and context-provided questions without iterative prompting or boundaries.	No numerical score; qualitative comments about ChatGPT’s answers, evaluating the accuracy and relevance of responses.	No direct performance comparison with other models or human experts. ChatGPT’s responses were nuanced and informed but showed potential limitations due to outdated data.
Yeo et al, 2023⁶²	164 questions about cirrhosis and HCC	ChatGPT	Objective: Assess the accuracy and reproducibility of ChatGPT in answering cirrhosis and HCC-related questions. Prompts: Evaluated ChatGPT’s responses to 164 diverse questions, independently graded by 2 transplant hepatologists and resolved by a third reviewer. The questions encompassed various aspects of cirrhosis and HCC.	High accuracy in basic knowledge, lifestyle, and treatment. 76.9% of questions answered correctly. However, it failed to specify decision-making cut-offs and treatment durations.	ChatGPT lacked knowledge of regional guidelines, such as HCC screening criteria, compared to physicians or trainees.
Zhu et al, 2023⁶³	22 prostate cancer questions based on CDC and UpToDate guidelines; clinical experience of authors	ChatGPT-3.5, ChatGPT 4, and other LLMs	Objective: Assess the performance of ChatGPT and other LLMs in answering prostate cancer-related questions. Prompts: 22 questions covering screening, prevention, treatment, and postoperative complications. The formulation of these questions was intended to test the models’ capacity to provide accurate and comprehensive responses in a medical context.	Most LLMs achieved over 90% accuracy, except NeevaAI and Chatsonic. The free version of ChatGPT slightly outperformed the paid version. LLMs were generally comprehensive and readable.	No direct comparison with human experts, but ChatGPT (with slightly better performance by ChatGPT 3.5) showed the highest accuracy among the LLMs
Sorin et al, 2023⁶⁴	Clinical information of 10 consecutive patients from a breast tumour board	ChatGPT-3.5	Objective: Evaluate ChatGPT-3.5 as a decision-support tool for breast tumour board decisions. Test the chatbot’s ability to process complex medical data and offer recommendations comparable to those of a professional tumour board. Prompts: clinical information of 10 consecutive patients presented at a breast tumour board was input into ChatGPT-3.5. Each patient’s case was detailed, and ChatGPT-3.5 was asked to provide management recommendations.	70% of ChatGPT’s recommendations aligned with tumour board decisions. Moderate to high agreement in grading scores. Grading scores for summarization, recommendation, and explanation varied, with mean scores indicating moderate to high agreement.	No direct comparison with other models or human experts, but ChatGPT’s recommendations aligned closely with those of the tumour board in a majority of cases.
Chen et al, 2023⁶⁵	104 prompts on breast, prostate, and lung cancer based on NCCN guidelines	ChatGPT (gpt-3.5-turbo-0301)	Objective: Evaluate ChatGPT’s performance in providing cancer treatment recommendations aligned with NCCN guidelines. Aimed to explore how differences in query formulation affect ChatGPT’s responses. Prompts: Four zero-shot prompt templates were developed for 26 unique diagnosis descriptions (cancer types ± extent of disease modifiers relevant for each cancer) for a total of 104 prompts.	ChatGPT provided at least 1 NCCN-concordant recommendation for 102 out of 104 prompts (98%). However, 34.3% of these prompts also included at least partially non-concordant recommendations. The responses varied based on prompt type.	No direct comparison with other models or human experts, but the study highlighted limitations in ChatGPT’s ability to provide consistently reliable and robust cancer treatment recommendations.
Nakamura et al, 2023⁶⁶	MedTxt-RR-JA dataset, with 135 de-identified CT radiology reports	GPT-3.5 Turbo, GPT-4	Objective: Evaluate feasibility of ChatGPT to automate the assignment of clinical TNM stages from unstructured CT radiology reports, focusing on optimizing prompt design for better performance. Prompt: Prompt was developed, incorporating the TNM classification rule and stepwise instructions. ChatGPT was guided to extract and assign TNM stages, explain its reasoning, and provide the final output in a JSON-like format.	GPT-4 outperformed GPT-3.5 Turbo in TNM staging accuracy, with GPT-4 scoring 52.2% vs 37.8% for the T category, 78.9% vs 68.9% for the N category, and 86.7% vs 67.8% for the M category.	GPT-4 outperformed GPT-3.5 Turbo, with improvements boosted by including the TNM rule. However, struggled with numerical reasoning, particularly in cases where tumour size determined the T category.
Truhn et al, 2024⁶⁷	Two sets of pathology reports: 100 colorectal cancer reports from TCGA. 1882 neuropathology reports of adult-type diffuse gliomas from the UCL	GPT-4	Objective: Evaluate GPT-4’s ability to extract structured data from unstructured pathology reports and assess its accuracy. Prompt: GPT-4 was provided predefined prompts to extract specific information, such as TNM staging and lymph node involvement, from reports. In some instances, GPT-4 was also tasked with proposing structured reporting templates based on unstructured text.	GPT-4 demonstrated high accuracy in extracting data from colorectal cancer reports, achieving 99% accuracy for T-stage, 95% for N-stage, 94% for M-stage, and 98-99% for lymph node data. In neuropathology reports, it also performed exceptionally well, accurately extracting key variables such as the Ki-67 labeling index and ATRX expression with near-perfect precision.	GPT-4 demonstrated high accuracy compared to manual data extraction, significantly reducing time and costs. However, limitations arose with low-quality scans and handwritten annotations, leading to errors in the OCR step.
Sushil et al, 2024⁶⁸	769 breast cancer pathology reports were retrieved from the UCSF clinical data warehouse, dated between January 1, 2012, and March 31, 2021	GPT-4, GPT-3.5, Starling-7B-beta, and ClinicalCamel-70B	Objective: Evaluated zero-shot classification of breast cancer pathology reports, comparing them to supervised machine learning models like random forests, LSTM with attention, and UCSF-BERT. The goal was to assess if LLMs could reduce the need for large-scale data annotations in clinical NLP tasks. Prompt: LLMs were prompted to extract 12 categories of breast cancer pathology information, such as tumour margins, lymph node involvement, and HER2 status, providing single answers in JSON format.	GPT-4 achieved the highest average macro F1 score of 0.86 across all tasks, surpassing the best supervised model (LSTM with attention), which scored 0.75. GPT-3.5 and other open-source models performed significantly worse, with GPT-3.5 scoring 0.55, Starling 0.36, and ClinicalCamel 0.34.	GPT-4 excelled in zero-shot setups, particularly in tasks with high label imbalance, like margin status inference. However, for tasks with sufficient training data, supervised models like LSTM performed comparably. Open-source models, including Starling and ClinicalCamel, struggled to match GPT-4’s performance.
Liang et al, 2024⁶⁹	80 RCC-related clinical questions, provided by urology experts	ChatGPT-3.5 and ChatGPT-4.0, fine-tuned GPT-3.5 Turbo	Objective: Evaluate ChatGPT’s efficacy in answering clinical questions related to renal oncology, focusing on performance improvement through fine-tuning. Prompt: Binary (yes/no) questions were posed to ChatGPT-3.5, ChatGPT-4.0, and a fine-tuned model, with each question repeated 3 times for consistency.	ChatGPT-4.0 outperformed ChatGPT-3.5 with 77.5% accuracy compared to 67.08%. After iterative optimization, the fine-tuned GPT-3.5 Turbo model achieved 93.75% accuracy.	ChatGPT-4.0 showed a statistically significant improvement over ChatGPT-3.5 (P < 0.05) in answering clinical questions, though both exhibited occasional response inconsistencies. The fine-tuned model resolved these issues, achieving 100% accuracy after iterative training, underscoring the potential for optimization through domain-specific training.
Marchi et al, 2024⁷⁰	68 hypothetical clinical cases covering various head and neck cancer stages and tumour sites, based on scenarios from the NCCN Guidelines Version 2.2024	ChatGPT-3.5	Objective: Evaluate ChatGPT’s performance in providing therapeutic recommendations for head and neck cancer compared to NCCN Guidelines, assessing its potential as an AI-assisted decision-making tool in oncology. Prompt: ChatGPT was asked about primary treatment, adjuvant treatment, and follow-up recommendations for hypothetical cases based on NCCN scenarios.	Primary treatment: ChatGPT achieved 85.3% accuracy, 100% sensitivity, and an F1 score of 0.92. Adjuvant treatment: It reached 95.59% accuracy, 100% sensitivity, and an F1 score of 0.96. Follow-up recommendations: ChatGPT attained 94.12% accuracy, 100% sensitivity, and an F1 score of 0.94.	ChatGPT showed high sensitivity and accuracy in line with NCCN Guidelines across tumour sites and stages, though minor inaccuracies appeared in primary treatment. While promising as a cancer care tool, challenges remain in handling complex, patient-specific decisions.
Gu et al, 2024⁷¹	The study used 160 fictitious liver MRI reports created by 3 radiologists and 72 de-identified real liver MRI reports from patients at Samsung Medical Center, Seoul.	GPT-4 (version gpt-4-0314)	Objective: Assess GPT-4’s ability to extract LI-RADS features and categorize liver lesions from multilingual liver MRI reports to automate key radiological feature extraction. Prompt: A 2-step prompt translated, summarized, and extracted LI-RADS features, using Python-based rules to calculate the category. Prompts were iteratively refined based on performance.	Internal Test (fictitious reports): GPT-4 achieved accuracies of 0.97 to 1.00 for major features like size, nonrim arterial phase hyperenhancement, and threshold growth, with F1 scores of 0.95 to 0.98. External Test (real reports): Accuracy ranged from 0.92 to 0.99 for major LI-RADS features, with F1 scores between 0.89 and 0.94. The overall accuracy for determining the LI-RADS category was 0.85 (95% CI: 0.76, 0.93).	GPT-4 performed slightly lower on external tests due to real report complexity, with higher error rates (4.5% vs 1.8% internal). Misinterpretation and index lesion selection errors were identified. Despite this, GPT-4 shows strong potential for automating radiology feature extraction, though improvements are needed in handling complex cases.
Lee et al, 2023⁷²	84 thyroid cancer surgical pathology reports from patients who underwent thyroid surgery between 2010 and 2022 at the Icahn School of Medicine at Mount Sinai.	FastChat-T5 (3B-parameter LLM)	Objective: Assess the efficacy of privacy-preserving LLMs for automated extraction of staging and recurrence risk information from thyroid cancer surgical pathology reports while ensuring patient privacy. Prompt: The model was prompted with 12 expert-designed medical questions, aimed at extracting AJCC/TNM staging data, recurrence risk factors, and tumour characteristics.	Concordance rates between the LLM and human reviewers were 88.86% with Reviewer 1 (SD: 7.02%) and 89.56% with Reviewer 2 (SD: 7.20%). The LLM processed all reports in 19.56 min, compared to 206.9 min for Reviewer 1 and 124.04 min for Reviewer 2.	The LLM achieved 100% concordance for simpler tasks like lymphatic invasion and tumour location but dropped to 75% for complex tasks like cervical lymph node presence. It reduced review time significantly, though further prompt engineering is needed for complex extractions.
Kuşcu et al, 2023⁷³	154 head and neck cancer-related questions were compiled from various sources, including professional institutions (eg, American Head and Neck Society, National Cancer Institute), patient support groups, and social media.	ChatGPT Plus, based on GPT-4 (March 2023 version)	Objective: Assess the accuracy and reliability of ChatGPT’s responses to HNC questions, focusing on its potential use for patient education and clinical decision support. Prompt: Questions were entered into ChatGPT individually, and responses were evaluated twice for reproducibility. Two head and neck surgeons graded the answers for accuracy, categorizing them as comprehensive/correct, incomplete/partially correct, mixed, or completely inaccurate/irrelevant.	ChatGPT delivered “comprehensive/correct” answers for 86.4% of the questions, “incomplete/partially correct” for 11%, and “mixed” (both accurate and inaccurate) for 2.6%. No “completely inaccurate/irrelevant” answers were reported.	ChatGPT achieved 100% accuracy in cancer prevention responses and 92.6% for diagnostic questions. It also demonstrated strong reproducibility, with 94.1% consistency across repeated queries. While ChatGPT shows promise as a patient education tool and for clinical decision support, further validation and refinement are needed for medical applications.
Gibson et al, 2024⁷⁴	The dataset included 8 commonly asked prostate cancer questions, derived through literature review and Google Trends.	ChatGPT-4	Objective: Evaluate ChatGPT-4’s accuracy, quality, readability, and safety in answering prostate cancer-related questions, aiming to determine its potential as a patient education tool. Prompt: Eight structured prostate cancer questions, like “What are the symptoms of prostate cancer?” and “What are the pros and cons of treatment options?” were posed to ChatGPT-4, with a request for references in each response.	The PEMAT-AI understandability score was 79.44% (SD: 10.44%), and DISCERN-AI rated the responses as “good” with a mean score of 13.88 (SD: 0.93). Readability algorithm, Flesch Reading Ease score of 45.97, and a Gunning Fog Index of 14.55, indicating an 11th-grade reading level. The NLAT-AI assessment gave mean scores above 3.0 for accuracy, safety, appropriateness, actionability, and effectiveness, indicating general reliability in ChatGPT’s responses.	ChatGPT-4’s outputs aligned well with current prostate cancer guidelines and literature, offering higher quality than static web pages. However, limitations included readability challenges and minor hallucinations (2 incorrect references out of 30). The study concluded that while ChatGPT-4 could enhance patient education, improvements in clarity and global applicability are needed.
Huang et al, 2024⁷⁵	Data was sourced from 2 main datasets: 78 valid lung cancer pathology reports from the CDSA for training. 774 valid pathology reports from TCGA for testing, after excluding invalid or duplicate reports.	ChatGPT-3.5-turbo-16k, GPT-4-turbo	Objective: Assess ChatGPT’s ability to extract structured data (eg, tumour staging, histological diagnosis) from free-text clinical notes. Prompt: ChatGPT extracted tumour features (pT), lymph node involvement (pN), overall stage, and diagnosis, following AJCC seventh edition guidelines, outputting results in JSON format.	ChatGPT achieved 87% accuracy for pT, 91% for pN, 76% for overall tumour stage, and 99% for histological diagnosis, with an overall accuracy of 89%.	ChatGPT-3.5-turbo outperformed NER and keyword search algorithms, which had accuracies of 76% and 51%, respectively. A comparison with GPT-4-turbo showed a 5% performance improvement, though GPT-4-turbo was more expensive. The study also highlighted the challenge of “hallucination” in ChatGPT, especially with irregular or incomplete pathology reports.
Huang et al, 2023⁷⁶	The data included the 38th ACR radiation oncology in-training examination (TXIT) with 300 multiple-choice questions and 15 complex clinical cases from the 2022 Red Journal Gray Zone collection.	ChatGPT-3.5 and ChatGPT-4	Objective: Benchmark ChatGPT-4’s performance on the TXIT examination and Gray Zone clinical cases to evaluate the potential of LLMs in medical education and clinical decision-making in radiation oncology. Prompt: For the TXIT examination, the models provided correct answers without justification. For Gray Zone cases, ChatGPT-4 was prompted as an “expert radiation oncologist” to provide diagnoses and treatment plans.	TXIT exam: ChatGPT-4 achieved 78.77% accuracy, outperforming ChatGPT-3.5’s 62.05%. ChatGPT-4 excelled in areas like statistics, CNS and eye, biology, and physics but struggled with bone and soft tissue and gynaecology topics. Gray Zone Cases: ChatGPT-4 provided highly correct and comprehensive responses, earning a correctness score of 3.5 and a comprehensiveness score of 3.1 (out of 4), as rated by clinical experts.	Compared to ChatGPT-3.5, ChatGPT-4 consistently outperformed in both the TXIT examination and clinical case evaluations. For complex Gray Zone cases, ChatGPT-4 offered novel treatment suggestions in 80% of cases, which human experts had not considered. However, 13.3% of its recommendations included hallucinations (plausible but incorrect responses), emphasizing the need for content verification in clinical settings.
Dennstädt et al, 2023⁷⁷	70 radiation oncology multiple-choice questions (clinical, physics, biology) and 25 OE clinical questions, reviewed by 6 radiation oncologists.	GPT-3.5-turbo	Objective: To assess ChatGPT’s ability to answer radiation oncology-specific multiple-choice and OE questions, evaluating its accuracy and usefulness in this specialized field. Prompt: ChatGPT provided the correct letter (A, B, C, or D) for multiple-choice questions, while radiation oncologists rated its OE responses on correctness and usefulness using a Likert scale.	Multiple-choice questions: ChatGPT provided valid answers for 66 of 70 questions (94.3%), with 60.61% correct. It performed best on physics questions (78.57% accuracy), followed by biology (58.33%) and clinical questions (50.0%). OE questions: 12 of 25 answers were rated as “acceptable”, “good”, or “very good” by all 6 radiation oncologists. Correctness scores ranged from 1.50 to 5.00 on a Likert scale, with a mean of 3.49.	ChatGPT performed reasonably well in answering radiation oncology-related questions but struggled with more complex, domain-specific tasks like fractionation calculations. While ChatGPT can generate correct and useful responses, its performance is inconsistent, particularly in specialized medical areas, due to potential “hallucinations” in answers not grounded in solid evidence.
Choi et al, 2023⁷⁸	Data from 2931 breast cancer patients were collected who underwent post-operative radiotherapy between 2020 and 2022 at Seoul National University Hospital. Clinical factors were extracted from surgical pathology and ultrasound reports.	ChatGPT (GPT-3.5-turbo)	Objective: Develop and evaluate LLM prompts for extracting clinical data from breast cancer reports, comparing accuracy, time, and cost to manual methods. Prompt: Twelve prompts targeted factors like tumour size, lymphovascular invasion, and surgery type, extracting and formatting structured and free-text data for analysis.	The LLM method achieved an overall accuracy of 87.7%, with factors like lymphovascular invasion reaching 98.2% accuracy, while neoadjuvant chemotherapy status had lower accuracy at 47.6%. Prompt development took 3.5 h, with 15 min for execution, costing US$95.4, including API fees.	LLM was significantly more time- and cost-efficient than both the full manual and LLM-assisted manual methods. The full manual method took 122.6 h and cost US$909.3, while the LLM method required just 4 h and US$95.4 to complete the same task for 2931 patients.
Rydzewski et al, 2024⁷⁹	2044 oncology multiple-choice questions from American College of Radiology examinations (2013-2021) and a separate validation set of 50 expert-created questions to prevent data leakage.	GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	Objective: Assess LLMs’ accuracy, confidence, and consistency in answering oncology multiple-choice questions to ensure safe clinical use. Prompt: Models answered questions, provided confidence scores (1-4), and explanations, with prompts repeated 3 times to evaluate consistency. GPT-3.5, GPT-4, Claude-v1, PaLM 2, LLaMA 1	GPT-4 achieved the highest accuracy at 68.7% across 3 replicates, outperforming other models. LLaMA 7B, with 25.6% accuracy, performed only slightly better than random guessing. GPT-4 was the only model to surpass the 50th percentile compared to human benchmarks, while other models lagged significantly.	Model performance varied significantly, with GPT-4 and Claude-v1 outperforming others. Accuracy was lower for female-predominant cancers (eg, breast and gynecologic) compared to other cancer types. GPT-4 and Claude-v1 achieved 81.1% and 81.7% accuracy, respectively, when combining high confidence and consistent responses. GPT-4 Turbo and Gemini 1.0 Ultra excelled in the novel validation set, showcasing improvements in newer models.
Lee et al, 2024⁷²	The dataset, totaling 1.17 million tokens, was created by integrating prostate cancer guidelines from sources such as the Korean Prostate Society, NCCN, ASCO, and EAU.	ChatGPT 3.5	Objective: AI-based chatbot that can provide accurate and real-time medical information to cancer patients. Prompt: The chatbot used a fixed prompt to ensure consistency in responses, with the temperature parameter set to 0.1 to generate accurate and reliable answers.	The AI-guide bot’s performance was evaluated using Likert scales in 3 categories: comprehensibility, content accuracy, and readability, with a total average score of 90.98 ± 4.02. Comprehensibility scored 28.28 ± 0.38, accuracy 34.17 ± 2.91, and readability 28.53 ± 1.24.	Compared to ChatGPT, the AI-guide bot demonstrated superior performance in comprehensibility and readability. In evaluations by non-medical experts, the AI-guide bot scored significantly higher than ChatGPT, with P-values <0.0001 in both categories. This indicates that the AI-guide bot provided clearer and more accurate medical information, while ChatGPT, drawing from a broader dataset, was less specialized.
Mou et al, 2024⁸⁰	Pathology reports from breast cancer patients at the University Hospital Aachen	GPT-4, Mixtral-8 × 7B	Objective: Transform unstructured pathology reports into a structured format for tumour documentation in line with the German Basic Oncology Dataset. Prompt: LLMs were tasked with extracting key data points from pathology reports, including diagnosis and tumour localization, and converting the information into structured formats according to a predefined data model.	Across 27 examinations, GPT-4 achieved near-human correctness (0.95) and higher completeness (0.97). Mixtral-8 × 7B lagged in both (correctness: 0.90, completeness: 0.95), especially on complex features.	GPT-4 outperformed Mixtral-8 × 7B in both correctness and completeness, particularly in identifying complex features like localization and ICD-10 diagnosis. However, GPT-4 raised concerns about privacy and regulatory compliance, making open-source models like Mixtral more suitable for privacy-sensitive environments. The authors recommend using GPT-4 for performance-critical tasks and Mixtral for privacy-focused scenarios.

This Feature Is Available To Subscribers Only