Machine learning models to predict surgical case duration compared to current industry standards: scoping review

Inclusion, exclusion and population, intervention, comparison and outcome (PICO) criteria for this review

Inclusion criteria	Exclusion criteria
Peer-reviewed articles Articles published in English or with freely available translation Papers discussing the use of different ML models to predict surgical case-time prediction All types of surgery to be included (robotic, arthroscopic, orthopaedics, ENT, etc.)	Studies done on non-human subjects Non-peer-reviewed articles Studies not reported in English or with an easily available translation Abstracts, case studies, case series, review articles, letters, technique papers and book chapters Full-text not available
PICO criteria
Population	Patients undergoing an operation in any surgical speciality
Intervention/exposure	Use of AI-based model to predict case-time duration
Control/comparator	Surgeon estimated/mean of last 10 cases used to predict case-time durations
Outcome(s)
Primary	To analyse the data from different AI models to understand if greater surgical case-time duration prediction is possible with AI models versus the current industry standards
Secondary	To establish whether there are efficiency benefits associated with the utilization of ML models in surgical block booking
Tertiary	To understand which models, and with which variables, provide the greatest improvement in case-time prediction

Inclusion criteria	Exclusion criteria
Peer-reviewed articles Articles published in English or with freely available translation Papers discussing the use of different ML models to predict surgical case-time prediction All types of surgery to be included (robotic, arthroscopic, orthopaedics, ENT, etc.)	Studies done on non-human subjects Non-peer-reviewed articles Studies not reported in English or with an easily available translation Abstracts, case studies, case series, review articles, letters, technique papers and book chapters Full-text not available
PICO criteria
Population	Patients undergoing an operation in any surgical speciality
Intervention/exposure	Use of AI-based model to predict case-time duration
Control/comparator	Surgeon estimated/mean of last 10 cases used to predict case-time durations
Outcome(s)
Primary	To analyse the data from different AI models to understand if greater surgical case-time duration prediction is possible with AI models versus the current industry standards
Secondary	To establish whether there are efficiency benefits associated with the utilization of ML models in surgical block booking
Tertiary	To understand which models, and with which variables, provide the greatest improvement in case-time prediction

ML, machine learning; ENT, ear, nose and throat; AI, artificial intelligence.

Table 1

Open in new tab Download slide

Inclusion, exclusion and population, intervention, comparison and outcome (PICO) criteria for this review

Inclusion criteria	Exclusion criteria
Peer-reviewed articles Articles published in English or with freely available translation Papers discussing the use of different ML models to predict surgical case-time prediction All types of surgery to be included (robotic, arthroscopic, orthopaedics, ENT, etc.)	Studies done on non-human subjects Non-peer-reviewed articles Studies not reported in English or with an easily available translation Abstracts, case studies, case series, review articles, letters, technique papers and book chapters Full-text not available
PICO criteria
Population	Patients undergoing an operation in any surgical speciality
Intervention/exposure	Use of AI-based model to predict case-time duration
Control/comparator	Surgeon estimated/mean of last 10 cases used to predict case-time durations
Outcome(s)
Primary	To analyse the data from different AI models to understand if greater surgical case-time duration prediction is possible with AI models versus the current industry standards
Secondary	To establish whether there are efficiency benefits associated with the utilization of ML models in surgical block booking
Tertiary	To understand which models, and with which variables, provide the greatest improvement in case-time prediction

Inclusion criteria	Exclusion criteria
Peer-reviewed articles Articles published in English or with freely available translation Papers discussing the use of different ML models to predict surgical case-time prediction All types of surgery to be included (robotic, arthroscopic, orthopaedics, ENT, etc.)	Studies done on non-human subjects Non-peer-reviewed articles Studies not reported in English or with an easily available translation Abstracts, case studies, case series, review articles, letters, technique papers and book chapters Full-text not available
PICO criteria
Population	Patients undergoing an operation in any surgical speciality
Intervention/exposure	Use of AI-based model to predict case-time duration
Control/comparator	Surgeon estimated/mean of last 10 cases used to predict case-time durations
Outcome(s)
Primary	To analyse the data from different AI models to understand if greater surgical case-time duration prediction is possible with AI models versus the current industry standards
Secondary	To establish whether there are efficiency benefits associated with the utilization of ML models in surgical block booking
Tertiary	To understand which models, and with which variables, provide the greatest improvement in case-time prediction

ML, machine learning; ENT, ear, nose and throat; AI, artificial intelligence.

Stage 2: identifying relevant studies

A systematic search of the literature was performed on 15 November 2021 using both the Healthcare Database Advanced Search (HDAS) searching Medline and EMBASE databases, and the PubMed native search tool, from the start of each respective database to November 2021. The search was most recently updated on 28 July 2023. The set of search terms and Medical Subject Heading (MeSH) terms (Table S2) were developed in conjunction with a medical librarian, using both keywords and MeSH terms. A grey literature search was conducted by undertaking a manual search of the reference lists of the included studies and further searches through the Google search engine, Google Scholar, ClinicalTrials.gov and the Cochrane Central Register of Controlled Trials (CENTRAL).

Stage 3: study selection

Duplicate citations were removed initially; following this, both reviewers (C.S. and O.A.S.) screened the titles and abstracts independently using the inclusion and exclusion criteria outlined in Table 1. Full texts of articles that met the criteria were retrieved and reviewed by both C.S. and O.A.S. for inclusion in the study. Disputes were settled upon consultation with a senior author (V.A.) and resolved by group consensus.

Stage 4: charting the data

All studies that were agreed upon for the final assessment were included in a database. Information was extracted from the articles on study quality, study characteristics, AI model characteristics, the predictive capacity of models, as well as study limitations, conclusions, and recommendations. Data points were extracted and recorded on standardized forms using Microsoft Excel v14.0 (Microsoft Corp., WA, USA).

Stage 5: collating, summarizing and reporting the results

An assessment of the level of the evidence included was based on the Oxford Centre for Evidence-Based Medicine (OCEBM) criteria³⁶. A methodological quality assessment was also performed. Due to the differences between the papers included and traditional surgical research papers, it was decided by the reviewers to create their methodological assessment tool based on work by Farrow et al.³⁷ (Table S3). Due to the heterogeneous nature of the data, it was not possible to perform a meta-analysis on the included studies; therefore, a narrative analysis of the different AI models and their capacity to predict surgical case duration was produced.

All studies used an evaluation metric to assess the model’s capacity to correctly predict surgical case duration when compared to test data. The evaluation metrics used varied between papers but mostly utilized R², mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), continuous ranked probability score (CRPS) and mean square error (MSE). Percentage overage/underage/within was utilized by Bartek et al.¹⁸ in one study. For further details on these metrics see Table S4. From the data sets identified within the included studies, the training–validation–test splits were reported as a ratio, as well as specific numbers.

Results

Search results and study inclusion

The systematic literature search described above yielded 2593 articles containing 132 duplicates: 2433 articles were excluded after the title/abstract review and 18 following the full paper review. Following the grey literature search and citation checking, 11 further articles were identified for possible inclusion, of which seven were excluded after a full-text review. Subsequently, 14 articles met the full eligibility criteria. The PRISMA diagram (Fig. 2) was created using the online application of Haddaway et al.³⁸.

Fig. 2

PRISMA diagram demonstrating the process of study selection, from screening to inclusion and the grey literature search (created using the online tool of Haddaway et al.³⁸)

Quality assessment (Supplementary Table S5)

Most of the studies included provided their study aims and conflicts of interest; however, Ng et al. did not disclose any conflicts of interest and the second author in Tuwatananurak et al. is the CEO of the company that developed the ML algorithm^16,25.

All papers included clear documentation as to the source of their data^{11,16,18–25,39–42}. Only one study performed any external validation on their data set; however, Lai et al. did discuss their reasons for not performing external validation and Abbas et al. internally validated using a national database^19,20,40. All studies provided information on data extraction and pre-processing^{11,16,18–25,39–42}. Three studies did not disclose their management of missing data^22,25,39. Only one study did not provide clear documentation concerning their approach to model, training, testing, and validation with clearly labelled numbers of samples at each stage²⁵. ‘Ground truth’ was outlined in most of the included studies apart from Abbas et al.⁴⁰. One study did not provide clear information on the type of models used but did explain that it used supervised learning techniques^24,25. Only one study provided an example of a power calculation and determined an appropriate sample size²⁴.

Only ten of the studies provided clear documentation of the sample’s demographics within their results^{11,19,20,22–24,39–42}. All studies included in the paper demonstrated some method of model evaluation^{11,16,18–25,39–42}. Thirteen studies demonstrated an explanation of the model with graphs/tables demonstrating which variables had the greatest effect on the model^{11,16,18–24,39–42}. Only one study did not discuss limitations¹⁶. Finally, one study did not discuss barriers to implementation and/or future work⁴¹.

Study characteristics

All studies included in the review were published on or after 2017^{11,16,18–25,39–42}, and 11 of 14 studies published were in the USA^{11,16,18,19,22–25,39}, with three further in Canada⁴⁰, Colombia²¹ and Taiwan²⁰ (Table S6). Most studies analysed surgeries from multiple specialities; however, six studies focused on individual specialities or surgeries^{11,22,39–42}. The size of the original data set varied from 500 up to 302 300^22,40. Most studies sourced their data from an EHR^{11,16,18–25,39,41,42}; Abbas et al. utilized the American College of Surgeons’ national surgical and quality improvement database⁴⁰. Most studies were retrospective observational studies; only Stromblad et al. was a randomized control trial²⁴.

AI characteristics

The majority of studies included in the review reported purely on how accurately an ML model predicted surgical case duration. One study assessed the efficiency benefits of such a system directly²⁴ while a further two explored efficiency savings in their discussion. The number of preoperative variables varied from seven up to >1500^21,25 (Table 2). All studies that reported ‘ground truth’ used the EHR’s recorded ‘wheels in/wheels out’ time for case duration^{11,16,18–25,39–42}. Jiao et al. were the only study to perform any external validation on their data set; Lai et al. discuss that currently, a set does not exist for external validation and that currently, only internal validation is appropriate^19,20.

Table 2

AI model characteristics

Author/Year	Intended use of AI algorithm	Number of preoperative variables utilized as inputs for models	Ground truth label	External validation of data source	Data cleaned prior to analysis	Total no. of patients’ records used	Number of patients used for training/testing	AI/predictive algorithms studied
Ng et al., 2017¹⁶	Prediction of surgical case-time duration/exploration of homoscedastic versus heteroscedastic modelling of data	27	EHR	No	Yes	86 796	∼69 400/∼10 400 (80/8/12 train validation test split)	Linear regression, MLP (with Gaussian, Laplace, gamma distributions)
Master et al., 2017¹¹	Prediction of surgical case-time duration	8 (+1 with surgeon’s prediction in some models)	EHR	No	Not described	1713	820/893	Single decision tree regressor, random forest regressor, gradient boosted regression trees
Zhao et al., 2018²²	Prediction of surgical case-time duration	28	EHR	No	Not described	424	424/500 (used entire data set as test)	Multivariable linear regression, ridge regression, lasso regression, random forest, boosted regression tree, neural network
Tuwatananurak et al., 2019²⁵	Prediction of surgical case-time duration	>1500	EHR	No	Not described	∼15 000	∼15 000/∼990	The proprietary Leap Rail® engine, which uses a combination of supervised learning algorithms
Bartek et al., 2019¹⁸	Prediction of surgical case-time duration (comparing two models, trained on a procedure-specific model or a surgeon-specific model)	13	EHR	No	Yes	46 986	37 588/9398	Linear regression, extreme gradient boosting (other models used but not reported in paper)
Jiao et al., 2020²³	Prediction of surgical case-time duration	11	EHR	No	Yes	52 735	42 377/10 358	Decision tree, random forest, gradient boosting decision tree, mixture density model
Martinez et al., 2021²¹	Prediction of surgical case-time duration	7	EHR	No	Yes	81 248	∼65 000/∼16 000 (80/20 train:test split)	Linear regression, support vector machines, regression trees, bagged trees
Strömblad et al., 2021²⁴	Assessment of benefit on more accurate predictions	Up to 300	EHR	No	Yes	756	605/151 (80/20 train/test split)	Random forest
Lai et al., 2021²⁰	Prediction of surgical case-time duration	20	EHR	No	Yes	86 621	∼82 300/4300 (95/5 train/test split)	Linear regression, random forest, extreme gradient boosting
Jiao et al., 2022¹⁹	Prediction of surgical case-time duration	>16 (not stated clearly)	EHR	Yes	Yes	69 018	59 926/9092 (92.5%/7.5% train/test split)	Bayesian approach, modular artificial neural network
Abbas et al., 2022⁴⁰	Prediction of surgical case-time duration and length of stay	32	EHR	No*	Yes	302 300	182 000/57 841/62 459 (Training/validation/test—split by years)	Linear regression, SGD regression, elastic net, linear SVM, KNN, decision tree, random forest, Adaboost, XGBoost, Scikit-learn MLP, PyTorch MLP
Miller et al., 2022³⁹	Prediction of surgical case-time duration	20	EHR	No	Yes	50 888	40 710/10 178 (80:20 training/test split)	CatBoost, XGBoost
Witvoet et al., 2023⁴¹	Determine how certain variables affected operative time and generate ML predictions for robotic-assisted primary total knee arthroplasty	30	EHR	No	Yes	18 465	14 772/3693 (80:20 training/test split)	CatBoost, tabNet
Gabriel et al., 2023⁴²	Prediction of surgical case-time duration	9	EHR	No	Yes	3189	2551/638 (80:20)	Linear regression, bagged trees, random forest, XGBoost

Author/Year	Intended use of AI algorithm	Number of preoperative variables utilized as inputs for models	Ground truth label	External validation of data source	Data cleaned prior to analysis	Total no. of patients’ records used	Number of patients used for training/testing	AI/predictive algorithms studied
Ng et al., 2017¹⁶	Prediction of surgical case-time duration/exploration of homoscedastic versus heteroscedastic modelling of data	27	EHR	No	Yes	86 796	∼69 400/∼10 400 (80/8/12 train validation test split)	Linear regression, MLP (with Gaussian, Laplace, gamma distributions)
Master et al., 2017¹¹	Prediction of surgical case-time duration	8 (+1 with surgeon’s prediction in some models)	EHR	No	Not described	1713	820/893	Single decision tree regressor, random forest regressor, gradient boosted regression trees
Zhao et al., 2018²²	Prediction of surgical case-time duration	28	EHR	No	Not described	424	424/500 (used entire data set as test)	Multivariable linear regression, ridge regression, lasso regression, random forest, boosted regression tree, neural network
Tuwatananurak et al., 2019²⁵	Prediction of surgical case-time duration	>1500	EHR	No	Not described	∼15 000	∼15 000/∼990	The proprietary Leap Rail® engine, which uses a combination of supervised learning algorithms
Bartek et al., 2019¹⁸	Prediction of surgical case-time duration (comparing two models, trained on a procedure-specific model or a surgeon-specific model)	13	EHR	No	Yes	46 986	37 588/9398	Linear regression, extreme gradient boosting (other models used but not reported in paper)
Jiao et al., 2020²³	Prediction of surgical case-time duration	11	EHR	No	Yes	52 735	42 377/10 358	Decision tree, random forest, gradient boosting decision tree, mixture density model
Martinez et al., 2021²¹	Prediction of surgical case-time duration	7	EHR	No	Yes	81 248	∼65 000/∼16 000 (80/20 train:test split)	Linear regression, support vector machines, regression trees, bagged trees
Strömblad et al., 2021²⁴	Assessment of benefit on more accurate predictions	Up to 300	EHR	No	Yes	756	605/151 (80/20 train/test split)	Random forest
Lai et al., 2021²⁰	Prediction of surgical case-time duration	20	EHR	No	Yes	86 621	∼82 300/4300 (95/5 train/test split)	Linear regression, random forest, extreme gradient boosting
Jiao et al., 2022¹⁹	Prediction of surgical case-time duration	>16 (not stated clearly)	EHR	Yes	Yes	69 018	59 926/9092 (92.5%/7.5% train/test split)	Bayesian approach, modular artificial neural network
Abbas et al., 2022⁴⁰	Prediction of surgical case-time duration and length of stay	32	EHR	No*	Yes	302 300	182 000/57 841/62 459 (Training/validation/test—split by years)	Linear regression, SGD regression, elastic net, linear SVM, KNN, decision tree, random forest, Adaboost, XGBoost, Scikit-learn MLP, PyTorch MLP
Miller et al., 2022³⁹	Prediction of surgical case-time duration	20	EHR	No	Yes	50 888	40 710/10 178 (80:20 training/test split)	CatBoost, XGBoost
Witvoet et al., 2023⁴¹	Determine how certain variables affected operative time and generate ML predictions for robotic-assisted primary total knee arthroplasty	30	EHR	No	Yes	18 465	14 772/3693 (80:20 training/test split)	CatBoost, tabNet
Gabriel et al., 2023⁴²	Prediction of surgical case-time duration	9	EHR	No	Yes	3189	2551/638 (80:20)	Linear regression, bagged trees, random forest, XGBoost

Bold indicates DL algorithms. AI, artificial intelligence; EHR, electronic health record; N/A, not applicable; ∼, approximately; SGD, stochastic gradient descent; SVM, support vector machine; KNN, k-nearest neighbour; AdaBoost, adaptive boosting; XGBoost, extreme gradient boosting; MLP, multilayer perceptron; CatBoost, categorical gradient boosting; tabNet, DL algorithm for tabular data. *While they haven’t tested on an external set.

Table 2

AI model characteristics

Author/Year	Intended use of AI algorithm	Number of preoperative variables utilized as inputs for models	Ground truth label	External validation of data source	Data cleaned prior to analysis	Total no. of patients’ records used	Number of patients used for training/testing	AI/predictive algorithms studied
Ng et al., 2017¹⁶	Prediction of surgical case-time duration/exploration of homoscedastic versus heteroscedastic modelling of data	27	EHR	No	Yes	86 796	∼69 400/∼10 400 (80/8/12 train validation test split)	Linear regression, MLP (with Gaussian, Laplace, gamma distributions)
Master et al., 2017¹¹	Prediction of surgical case-time duration	8 (+1 with surgeon’s prediction in some models)	EHR	No	Not described	1713	820/893	Single decision tree regressor, random forest regressor, gradient boosted regression trees
Zhao et al., 2018²²	Prediction of surgical case-time duration	28	EHR	No	Not described	424	424/500 (used entire data set as test)	Multivariable linear regression, ridge regression, lasso regression, random forest, boosted regression tree, neural network
Tuwatananurak et al., 2019²⁵	Prediction of surgical case-time duration	>1500	EHR	No	Not described	∼15 000	∼15 000/∼990	The proprietary Leap Rail® engine, which uses a combination of supervised learning algorithms
Bartek et al., 2019¹⁸	Prediction of surgical case-time duration (comparing two models, trained on a procedure-specific model or a surgeon-specific model)	13	EHR	No	Yes	46 986	37 588/9398	Linear regression, extreme gradient boosting (other models used but not reported in paper)
Jiao et al., 2020²³	Prediction of surgical case-time duration	11	EHR	No	Yes	52 735	42 377/10 358	Decision tree, random forest, gradient boosting decision tree, mixture density model
Martinez et al., 2021²¹	Prediction of surgical case-time duration	7	EHR	No	Yes	81 248	∼65 000/∼16 000 (80/20 train:test split)	Linear regression, support vector machines, regression trees, bagged trees
Strömblad et al., 2021²⁴	Assessment of benefit on more accurate predictions	Up to 300	EHR	No	Yes	756	605/151 (80/20 train/test split)	Random forest
Lai et al., 2021²⁰	Prediction of surgical case-time duration	20	EHR	No	Yes	86 621	∼82 300/4300 (95/5 train/test split)	Linear regression, random forest, extreme gradient boosting
Jiao et al., 2022¹⁹	Prediction of surgical case-time duration	>16 (not stated clearly)	EHR	Yes	Yes	69 018	59 926/9092 (92.5%/7.5% train/test split)	Bayesian approach, modular artificial neural network
Abbas et al., 2022⁴⁰	Prediction of surgical case-time duration and length of stay	32	EHR	No*	Yes	302 300	182 000/57 841/62 459 (Training/validation/test—split by years)	Linear regression, SGD regression, elastic net, linear SVM, KNN, decision tree, random forest, Adaboost, XGBoost, Scikit-learn MLP, PyTorch MLP
Miller et al., 2022³⁹	Prediction of surgical case-time duration	20	EHR	No	Yes	50 888	40 710/10 178 (80:20 training/test split)	CatBoost, XGBoost
Witvoet et al., 2023⁴¹	Determine how certain variables affected operative time and generate ML predictions for robotic-assisted primary total knee arthroplasty	30	EHR	No	Yes	18 465	14 772/3693 (80:20 training/test split)	CatBoost, tabNet
Gabriel et al., 2023⁴²	Prediction of surgical case-time duration	9	EHR	No	Yes	3189	2551/638 (80:20)	Linear regression, bagged trees, random forest, XGBoost

Author/Year	Intended use of AI algorithm	Number of preoperative variables utilized as inputs for models	Ground truth label	External validation of data source	Data cleaned prior to analysis	Total no. of patients’ records used	Number of patients used for training/testing	AI/predictive algorithms studied
Ng et al., 2017¹⁶	Prediction of surgical case-time duration/exploration of homoscedastic versus heteroscedastic modelling of data	27	EHR	No	Yes	86 796	∼69 400/∼10 400 (80/8/12 train validation test split)	Linear regression, MLP (with Gaussian, Laplace, gamma distributions)
Master et al., 2017¹¹	Prediction of surgical case-time duration	8 (+1 with surgeon’s prediction in some models)	EHR	No	Not described	1713	820/893	Single decision tree regressor, random forest regressor, gradient boosted regression trees
Zhao et al., 2018²²	Prediction of surgical case-time duration	28	EHR	No	Not described	424	424/500 (used entire data set as test)	Multivariable linear regression, ridge regression, lasso regression, random forest, boosted regression tree, neural network
Tuwatananurak et al., 2019²⁵	Prediction of surgical case-time duration	>1500	EHR	No	Not described	∼15 000	∼15 000/∼990	The proprietary Leap Rail® engine, which uses a combination of supervised learning algorithms
Bartek et al., 2019¹⁸	Prediction of surgical case-time duration (comparing two models, trained on a procedure-specific model or a surgeon-specific model)	13	EHR	No	Yes	46 986	37 588/9398	Linear regression, extreme gradient boosting (other models used but not reported in paper)
Jiao et al., 2020²³	Prediction of surgical case-time duration	11	EHR	No	Yes	52 735	42 377/10 358	Decision tree, random forest, gradient boosting decision tree, mixture density model
Martinez et al., 2021²¹	Prediction of surgical case-time duration	7	EHR	No	Yes	81 248	∼65 000/∼16 000 (80/20 train:test split)	Linear regression, support vector machines, regression trees, bagged trees
Strömblad et al., 2021²⁴	Assessment of benefit on more accurate predictions	Up to 300	EHR	No	Yes	756	605/151 (80/20 train/test split)	Random forest
Lai et al., 2021²⁰	Prediction of surgical case-time duration	20	EHR	No	Yes	86 621	∼82 300/4300 (95/5 train/test split)	Linear regression, random forest, extreme gradient boosting
Jiao et al., 2022¹⁹	Prediction of surgical case-time duration	>16 (not stated clearly)	EHR	Yes	Yes	69 018	59 926/9092 (92.5%/7.5% train/test split)	Bayesian approach, modular artificial neural network
Abbas et al., 2022⁴⁰	Prediction of surgical case-time duration and length of stay	32	EHR	No*	Yes	302 300	182 000/57 841/62 459 (Training/validation/test—split by years)	Linear regression, SGD regression, elastic net, linear SVM, KNN, decision tree, random forest, Adaboost, XGBoost, Scikit-learn MLP, PyTorch MLP
Miller et al., 2022³⁹	Prediction of surgical case-time duration	20	EHR	No	Yes	50 888	40 710/10 178 (80:20 training/test split)	CatBoost, XGBoost
Witvoet et al., 2023⁴¹	Determine how certain variables affected operative time and generate ML predictions for robotic-assisted primary total knee arthroplasty	30	EHR	No	Yes	18 465	14 772/3693 (80:20 training/test split)	CatBoost, tabNet
Gabriel et al., 2023⁴²	Prediction of surgical case-time duration	9	EHR	No	Yes	3189	2551/638 (80:20)	Linear regression, bagged trees, random forest, XGBoost

After processing the data using appropriate inclusion/exclusion criteria and removing missing data (not all studies did this), the total number of records used for training and testing varied from 424 up to 302 300^22,40. There was a large variety in the number of ML algorithms utilized, all of which used supervised learning methodology, but the specific algorithms used were: linear regression, stochastic gradient descent, k-nearest neighbours, single decision tree regressor, random forest regressor, gradient boosted regression trees, extreme gradient boosting, categorical gradient boosting, neural networks, support vector machines, bagged trees, TabNet, multilayer perceptrons and mixture density models.

Predictive capacity of models

Table 3 demonstrates the results of each study. Ten of the studies included ‘feature importance’ information (which variable has the greatest impact on the model(s))^{11,16,18,20,23,25,39–42}. Three studies reported that the type of procedure was the most important variable on duration^16,20,39; other studies that reported feature importance commonly demonstrated that expert prediction/scheduled duration, primary surgeon, patient weight and average case-time duration of the latest ten surgeries at the procedure level all had significant impacts on the models they designed^{11,18,23,25,41}. Abbas et al.⁴⁰ demonstrated that renal failure and transfusions given within 72 h preoperatively were the most important variables; this study only focused on one specific surgery.

Table 3

Outcomes of models developed within included studies

Author/Year	Control	Variable(s) with the greatest influence on prediction	Evaluation/comparison metric	AI/predictive algorithms utilized		Performance of AI model(s) versus control	AI model more accurate than control (statistically significant)	Conclusion
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	Linear regression,		45.23 versus 49.8	Not stated	The study demonstrates the efficacy of machine learning and the heteroscedastic nature of surgical duration data
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	MLP (gamma distribution)*		*43.38 versus* 49.8**	Yes, P = 0.01
Master et al., 2017¹¹	Historical averaging	Expert prediction/primary surgeon/patient weight	R², average prediction accuracy by study-derived performance metric	Single decision tree regressor	With EP	0.42 versus 0.34	Not stated	New prediction models outclass old models and if used in conjunction with expert opinion outperform expert opinion. Potential for decision support tools to automate OT scheduling
				Single decision tree regressor	Without EP	0.28 versus 0.34	Not stated
				Random forest regressor	With EP	0.57 versus 0.34	Yes (not stated)
				Random forest regressor	Without EP	0.38 versus 0.34	Not stated
				Gradient-boosted regression trees*	With EP	0.61 versus 0.34	Yes (Not stated)
				Gradient-boosted regression trees*	Without EP	0.44 versus 0.34	Not stated
Zhao et al., 2018²²	Historical averaging	Not reported	RMSE	Multivariable linear regression,		86.8 versus 100.4	No (95% confidence intervals)	ML-based predictive models are more accurate than current methods. This will increase the number of accurately booked case durations which may reduce under- and overutilization of OTs
				Ridge regression		82.4 versus 100.4	No (95% confidence intervals)
				Lasso regression		81.3 versus 100.4	No (95% confidence intervals)
				Random forest		81.9 versus 100.4	Yes (95% confidence intervals)
				Boosted regression tree*		80.2 versus 100.4	Yes (95% confidence intervals)
				Neural network		*89.6 versus* 100.4**	No (95% confidence intervals)
Tuwatananurak et al., 2019²⁵	Historical averaging	Historical averaging	Mean absolute difference with interquartile ranges	The proprietary Leap Rail® engine uses a combination of supervised learning algorithms*		20.0 versus 27.0 (for all cases)	Yes, P = 0.0001 (for all cases)	Statistically significant improvement of an average of 7 minutes with the LeapRail engine. Post-hoc modelling suggests this could represent a 70% reduction in scheduling inaccuracy
Bartek et al., 2019¹⁸	Historical averaging and expert predictions	Average case-time duration of latest 10 surgeries at the procedure level	MAPE, R², percentage overage, percentage underage, percentage within 10%	Linear regression		36% versus 30%	Not stated	XGBoost ML models demonstrated the best results versus other ML models/current standards. Potential for surgeon-specific ML models to improve scheduling
Bartek et al., 2019¹⁸	Historical averaging and expert predictions			Extreme gradient boosting*		26% versus 30%	Not stated
Jiao et al., 2020²³	Expert prediction primarily (and historical averaging)	Scheduled duration	CRPS	Bayesian statistical method		21.2 (min) versus 32.1 (min)	Not stated	Demonstrated unstructured hospital data can be used for prediction. Advanced application of ML in this field to potentially inform an intelligent scheduling system
				Decision tree		22.1 (min) versus 32.1 (min)	Not stated
				Random forest		19.6 (min) versus 32.1 (mins)	Not stated
				Gradient boosted decision tree		19.5 (min) versus versus 32.1 (min)	Not stated
				Mixture density network*		*18.1 (min) versus* 32.1 (min)**	Not stated
Martinez et al., 2021²¹	Historical average and expert prediction	Not reported	RMSE	Linear regression		30.84 versus 26.09 (versus bagged trees)	Not stated	Bagged tree algorithms show an improved overall error rate compared with traditional methods. They recommend research on complementary periods like anaesthesia/cleaning/recovery
				Support vector machine		30.27 versus 26.09 (versus bagged trees)	Not stated
				Regression trees		27.94 versus 26.09 (versus bagged trees)	Not stated
				Bagged trees*		27.98 versus 64.34 (versus current standards)	Not stated
Strömblad et al., 2021²⁴	Historical averaging and expert predictions	Not reported	MAE	Random forest*		49.5 (mins) versus 59.3 (mins) (16.5% improvement)	Yes, P = 0.03	Implementation of an ML model significantly improved accuracy in predicting case duration and led to reduced patient wait-time, and reduced pre-surgical length of stay
Lai et al., 2021¹⁹	Historical average	Procedure type	R², MAE, percentage overage/underage and within the threshold	Linear regression		0.72 versus 0.68	Not stated	The XGB model was superior in predictive performance to the average, regression and random forest models
				Random forest		0.74 versus 0.68	Not stated
				Extreme gradient boosting*		0.77 versus 0.68	Not stated
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	Bayesian approach		20.3 min versus 37.0 min	Not stated	They suggest that ML models have a role in informing operational decisions, they are superior to classical scheduling and traditional statistical alternatives. ML may reduce the costs of surgery
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	MANN*		*13.8 min versus* 37.0 min**	*P < 0.001 (for MANN versus* Bayesian approach)**
Abbas et al., 2022⁴⁰	Mean regressor (historical average)	Renal failure, transfusion given within 72 h preoperatively, in/outpatient status, CHF status, presence of disseminated cancer (study performed on only one type of surgery)	MSE	Linear regression		0.989 versus 1.031	Not stated	The study demonstrated that both deep ML models and conventional ML models were superior to mean regression. However, there was not a significant difference between deep and conventional ML models when looking at accuracy predictions
				SGD regression		1.013 versus 1.031	Not stated
				Elastic Net		0.999 versus 1.031	Not stated
				Linear support vector machine		0.994 versus 1.031	Not stated
				K-nearest neighbours		1.156 versus 1.031	Not stated
				Decision tree		1.032 versus 1.031	Not stated
				Random forest		1.009 versus 1.031	Not stated
				AdaBoost		1.031 versus 1.031	Not stated
				XGBoost		1.003 versus 1.031	Not stated
				Scikit-learn MLP		*0.978 versus* 1.031**	Not stated
				PyTorch MLP*		*0.893 versus* 1.031**	Not stated
Miller et al., 2022³⁹	Historical average	Procedure performed, surgeon, type of case by subspecialty, and surgery case status (day surgery versus inpatient)	RMSE, MAE	XGBoost		39.3 versus 46.3	Yes, P < 0.001	Application of a machine learning algorithm to historical otolaryngology case data enabled significant improvement in the prediction of OR case duration. Such methods have the potential to increase case duration accuracy and may result in significant cost savings
Miller et al., 2022³⁹	Historical average		RMSE, MAE	CatBoost*		38.2 versus 46.3	Yes, P < 0.001
Witvoet et al., 2023⁴¹	Historical average	Average surgical time, gender, number of surgeries executed by the surgeon until case date	R², RMSE, within 5 min %, within 10 min %, within 15 mis %	CatBoost*		0.53 versus 0.49	Yes, P = 0.003	The ML model developed demonstrated superior accuracy for predicting operative time using hospital, surgeon and patient data compared to historical averages. Hospitals should consider predicting operative times by means of ML algorithms. By doing so they may be able to optimize resource utilization
Witvoet et al., 2023⁴¹	Historical average		R², RMSE, within 5 min %, within 10 min %, within 15 mis %	TabNet		*0.51 versus* 0.49**	Yes, P = 0.013
Gabriel et al., 2023⁴²	Historical average and expert prediction	BMI, spine fusion	R², MAE, RMSE	Linear regression		0.34 versus −0.57	Not stated	The use of ensemble learning with patient and procedure specific features (available preoperatively) outperformed the prediction of spine surgery case duration compared to standard predictions. The implementation of ML models presents an alternative pathway to increasing efficiency and enrich patient outcomes
				Random forest		0.76 versus −0.57	Not stated
				Bagged trees		0.76 versus −0.57	Not stated
				XGBoost*		0.77 versus −0.57	Not stated

Author/Year	Control	Variable(s) with the greatest influence on prediction	Evaluation/comparison metric	AI/predictive algorithms utilized		Performance of AI model(s) versus control	AI model more accurate than control (statistically significant)	Conclusion
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	Linear regression,		45.23 versus 49.8	Not stated	The study demonstrates the efficacy of machine learning and the heteroscedastic nature of surgical duration data
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	MLP (gamma distribution)*		*43.38 versus* 49.8**	Yes, P = 0.01
Master et al., 2017¹¹	Historical averaging	Expert prediction/primary surgeon/patient weight	R², average prediction accuracy by study-derived performance metric	Single decision tree regressor	With EP	0.42 versus 0.34	Not stated	New prediction models outclass old models and if used in conjunction with expert opinion outperform expert opinion. Potential for decision support tools to automate OT scheduling
				Single decision tree regressor	Without EP	0.28 versus 0.34	Not stated
				Random forest regressor	With EP	0.57 versus 0.34	Yes (not stated)
				Random forest regressor	Without EP	0.38 versus 0.34	Not stated
				Gradient-boosted regression trees*	With EP	0.61 versus 0.34	Yes (Not stated)
				Gradient-boosted regression trees*	Without EP	0.44 versus 0.34	Not stated
Zhao et al., 2018²²	Historical averaging	Not reported	RMSE	Multivariable linear regression,		86.8 versus 100.4	No (95% confidence intervals)	ML-based predictive models are more accurate than current methods. This will increase the number of accurately booked case durations which may reduce under- and overutilization of OTs
				Ridge regression		82.4 versus 100.4	No (95% confidence intervals)
				Lasso regression		81.3 versus 100.4	No (95% confidence intervals)
				Random forest		81.9 versus 100.4	Yes (95% confidence intervals)
				Boosted regression tree*		80.2 versus 100.4	Yes (95% confidence intervals)
				Neural network		*89.6 versus* 100.4**	No (95% confidence intervals)
Tuwatananurak et al., 2019²⁵	Historical averaging	Historical averaging	Mean absolute difference with interquartile ranges	The proprietary Leap Rail® engine uses a combination of supervised learning algorithms*		20.0 versus 27.0 (for all cases)	Yes, P = 0.0001 (for all cases)	Statistically significant improvement of an average of 7 minutes with the LeapRail engine. Post-hoc modelling suggests this could represent a 70% reduction in scheduling inaccuracy
Bartek et al., 2019¹⁸	Historical averaging and expert predictions	Average case-time duration of latest 10 surgeries at the procedure level	MAPE, R², percentage overage, percentage underage, percentage within 10%	Linear regression		36% versus 30%	Not stated	XGBoost ML models demonstrated the best results versus other ML models/current standards. Potential for surgeon-specific ML models to improve scheduling
Bartek et al., 2019¹⁸	Historical averaging and expert predictions			Extreme gradient boosting*		26% versus 30%	Not stated
Jiao et al., 2020²³	Expert prediction primarily (and historical averaging)	Scheduled duration	CRPS	Bayesian statistical method		21.2 (min) versus 32.1 (min)	Not stated	Demonstrated unstructured hospital data can be used for prediction. Advanced application of ML in this field to potentially inform an intelligent scheduling system
				Decision tree		22.1 (min) versus 32.1 (min)	Not stated
				Random forest		19.6 (min) versus 32.1 (mins)	Not stated
				Gradient boosted decision tree		19.5 (min) versus versus 32.1 (min)	Not stated
				Mixture density network*		*18.1 (min) versus* 32.1 (min)**	Not stated
Martinez et al., 2021²¹	Historical average and expert prediction	Not reported	RMSE	Linear regression		30.84 versus 26.09 (versus bagged trees)	Not stated	Bagged tree algorithms show an improved overall error rate compared with traditional methods. They recommend research on complementary periods like anaesthesia/cleaning/recovery
				Support vector machine		30.27 versus 26.09 (versus bagged trees)	Not stated
				Regression trees		27.94 versus 26.09 (versus bagged trees)	Not stated
				Bagged trees*		27.98 versus 64.34 (versus current standards)	Not stated
Strömblad et al., 2021²⁴	Historical averaging and expert predictions	Not reported	MAE	Random forest*		49.5 (mins) versus 59.3 (mins) (16.5% improvement)	Yes, P = 0.03	Implementation of an ML model significantly improved accuracy in predicting case duration and led to reduced patient wait-time, and reduced pre-surgical length of stay
Lai et al., 2021¹⁹	Historical average	Procedure type	R², MAE, percentage overage/underage and within the threshold	Linear regression		0.72 versus 0.68	Not stated	The XGB model was superior in predictive performance to the average, regression and random forest models
				Random forest		0.74 versus 0.68	Not stated
				Extreme gradient boosting*		0.77 versus 0.68	Not stated
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	Bayesian approach		20.3 min versus 37.0 min	Not stated	They suggest that ML models have a role in informing operational decisions, they are superior to classical scheduling and traditional statistical alternatives. ML may reduce the costs of surgery
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	MANN*		*13.8 min versus* 37.0 min**	*P < 0.001 (for MANN versus* Bayesian approach)**
Abbas et al., 2022⁴⁰	Mean regressor (historical average)	Renal failure, transfusion given within 72 h preoperatively, in/outpatient status, CHF status, presence of disseminated cancer (study performed on only one type of surgery)	MSE	Linear regression		0.989 versus 1.031	Not stated	The study demonstrated that both deep ML models and conventional ML models were superior to mean regression. However, there was not a significant difference between deep and conventional ML models when looking at accuracy predictions
				SGD regression		1.013 versus 1.031	Not stated
				Elastic Net		0.999 versus 1.031	Not stated
				Linear support vector machine		0.994 versus 1.031	Not stated
				K-nearest neighbours		1.156 versus 1.031	Not stated
				Decision tree		1.032 versus 1.031	Not stated
				Random forest		1.009 versus 1.031	Not stated
				AdaBoost		1.031 versus 1.031	Not stated
				XGBoost		1.003 versus 1.031	Not stated
				Scikit-learn MLP		*0.978 versus* 1.031**	Not stated
				PyTorch MLP*		*0.893 versus* 1.031**	Not stated
Miller et al., 2022³⁹	Historical average	Procedure performed, surgeon, type of case by subspecialty, and surgery case status (day surgery versus inpatient)	RMSE, MAE	XGBoost		39.3 versus 46.3	Yes, P < 0.001	Application of a machine learning algorithm to historical otolaryngology case data enabled significant improvement in the prediction of OR case duration. Such methods have the potential to increase case duration accuracy and may result in significant cost savings
Miller et al., 2022³⁹	Historical average		RMSE, MAE	CatBoost*		38.2 versus 46.3	Yes, P < 0.001
Witvoet et al., 2023⁴¹	Historical average	Average surgical time, gender, number of surgeries executed by the surgeon until case date	R², RMSE, within 5 min %, within 10 min %, within 15 mis %	CatBoost*		0.53 versus 0.49	Yes, P = 0.003	The ML model developed demonstrated superior accuracy for predicting operative time using hospital, surgeon and patient data compared to historical averages. Hospitals should consider predicting operative times by means of ML algorithms. By doing so they may be able to optimize resource utilization
Witvoet et al., 2023⁴¹	Historical average		R², RMSE, within 5 min %, within 10 min %, within 15 mis %	TabNet		*0.51 versus* 0.49**	Yes, P = 0.013
Gabriel et al., 2023⁴²	Historical average and expert prediction	BMI, spine fusion	R², MAE, RMSE	Linear regression		0.34 versus −0.57	Not stated	The use of ensemble learning with patient and procedure specific features (available preoperatively) outperformed the prediction of spine surgery case duration compared to standard predictions. The implementation of ML models presents an alternative pathway to increasing efficiency and enrich patient outcomes
				Random forest		0.76 versus −0.57	Not stated
				Bagged trees		0.76 versus −0.57	Not stated
				XGBoost*		0.77 versus −0.57	Not stated

Bold indicates DL algorithms and which evaluation metric is presented in the sixth column. MANN, modular artificial neural network; RMSE, root mean square error; MAE, mean absolute error; NLL, negative log-likelihood; MAPE, mean absolute percentage error; CRPS, continuous ranked probability score; GBT, gradient boosted decision tree; MDN, mixed density network; OT, operating theatre; ML, machine learning; EP, expert prediction; MSE, mean square error; CHF, congestive heart failure; SGD, stochastic gradient descent; SVM, support vector machine; MLP, multilayer perceptron; XGBoost, extreme gradient boosting; CatBoost, categorical gradient boosting; TabNet, deep neural network for tabular data; AI, artificial intelligence. *Indicates the most accurate model.

Table 3

Open in new tab Download slide

Outcomes of models developed within included studies

Author/Year	Control	Variable(s) with the greatest influence on prediction	Evaluation/comparison metric	AI/predictive algorithms utilized		Performance of AI model(s) versus control	AI model more accurate than control (statistically significant)	Conclusion
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	Linear regression,		45.23 versus 49.8	Not stated	The study demonstrates the efficacy of machine learning and the heteroscedastic nature of surgical duration data
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	MLP (gamma distribution)*		*43.38 versus* 49.8**	Yes, P = 0.01
Master et al., 2017¹¹	Historical averaging	Expert prediction/primary surgeon/patient weight	R², average prediction accuracy by study-derived performance metric	Single decision tree regressor	With EP	0.42 versus 0.34	Not stated	New prediction models outclass old models and if used in conjunction with expert opinion outperform expert opinion. Potential for decision support tools to automate OT scheduling
				Single decision tree regressor	Without EP	0.28 versus 0.34	Not stated
				Random forest regressor	With EP	0.57 versus 0.34	Yes (not stated)
				Random forest regressor	Without EP	0.38 versus 0.34	Not stated
				Gradient-boosted regression trees*	With EP	0.61 versus 0.34	Yes (Not stated)
				Gradient-boosted regression trees*	Without EP	0.44 versus 0.34	Not stated
Zhao et al., 2018²²	Historical averaging	Not reported	RMSE	Multivariable linear regression,		86.8 versus 100.4	No (95% confidence intervals)	ML-based predictive models are more accurate than current methods. This will increase the number of accurately booked case durations which may reduce under- and overutilization of OTs
				Ridge regression		82.4 versus 100.4	No (95% confidence intervals)
				Lasso regression		81.3 versus 100.4	No (95% confidence intervals)
				Random forest		81.9 versus 100.4	Yes (95% confidence intervals)
				Boosted regression tree*		80.2 versus 100.4	Yes (95% confidence intervals)
				Neural network		*89.6 versus* 100.4**	No (95% confidence intervals)
Tuwatananurak et al., 2019²⁵	Historical averaging	Historical averaging	Mean absolute difference with interquartile ranges	The proprietary Leap Rail® engine uses a combination of supervised learning algorithms*		20.0 versus 27.0 (for all cases)	Yes, P = 0.0001 (for all cases)	Statistically significant improvement of an average of 7 minutes with the LeapRail engine. Post-hoc modelling suggests this could represent a 70% reduction in scheduling inaccuracy
Bartek et al., 2019¹⁸	Historical averaging and expert predictions	Average case-time duration of latest 10 surgeries at the procedure level	MAPE, R², percentage overage, percentage underage, percentage within 10%	Linear regression		36% versus 30%	Not stated	XGBoost ML models demonstrated the best results versus other ML models/current standards. Potential for surgeon-specific ML models to improve scheduling
Bartek et al., 2019¹⁸	Historical averaging and expert predictions			Extreme gradient boosting*		26% versus 30%	Not stated
Jiao et al., 2020²³	Expert prediction primarily (and historical averaging)	Scheduled duration	CRPS	Bayesian statistical method		21.2 (min) versus 32.1 (min)	Not stated	Demonstrated unstructured hospital data can be used for prediction. Advanced application of ML in this field to potentially inform an intelligent scheduling system
				Decision tree		22.1 (min) versus 32.1 (min)	Not stated
				Random forest		19.6 (min) versus 32.1 (mins)	Not stated
				Gradient boosted decision tree		19.5 (min) versus versus 32.1 (min)	Not stated
				Mixture density network*		*18.1 (min) versus* 32.1 (min)**	Not stated
Martinez et al., 2021²¹	Historical average and expert prediction	Not reported	RMSE	Linear regression		30.84 versus 26.09 (versus bagged trees)	Not stated	Bagged tree algorithms show an improved overall error rate compared with traditional methods. They recommend research on complementary periods like anaesthesia/cleaning/recovery
				Support vector machine		30.27 versus 26.09 (versus bagged trees)	Not stated
				Regression trees		27.94 versus 26.09 (versus bagged trees)	Not stated
				Bagged trees*		27.98 versus 64.34 (versus current standards)	Not stated
Strömblad et al., 2021²⁴	Historical averaging and expert predictions	Not reported	MAE	Random forest*		49.5 (mins) versus 59.3 (mins) (16.5% improvement)	Yes, P = 0.03	Implementation of an ML model significantly improved accuracy in predicting case duration and led to reduced patient wait-time, and reduced pre-surgical length of stay
Lai et al., 2021¹⁹	Historical average	Procedure type	R², MAE, percentage overage/underage and within the threshold	Linear regression		0.72 versus 0.68	Not stated	The XGB model was superior in predictive performance to the average, regression and random forest models
				Random forest		0.74 versus 0.68	Not stated
				Extreme gradient boosting*		0.77 versus 0.68	Not stated
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	Bayesian approach		20.3 min versus 37.0 min	Not stated	They suggest that ML models have a role in informing operational decisions, they are superior to classical scheduling and traditional statistical alternatives. ML may reduce the costs of surgery
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	MANN*		*13.8 min versus* 37.0 min**	*P < 0.001 (for MANN versus* Bayesian approach)**
Abbas et al., 2022⁴⁰	Mean regressor (historical average)	Renal failure, transfusion given within 72 h preoperatively, in/outpatient status, CHF status, presence of disseminated cancer (study performed on only one type of surgery)	MSE	Linear regression		0.989 versus 1.031	Not stated	The study demonstrated that both deep ML models and conventional ML models were superior to mean regression. However, there was not a significant difference between deep and conventional ML models when looking at accuracy predictions
				SGD regression		1.013 versus 1.031	Not stated
				Elastic Net		0.999 versus 1.031	Not stated
				Linear support vector machine		0.994 versus 1.031	Not stated
				K-nearest neighbours		1.156 versus 1.031	Not stated
				Decision tree		1.032 versus 1.031	Not stated
				Random forest		1.009 versus 1.031	Not stated
				AdaBoost		1.031 versus 1.031	Not stated
				XGBoost		1.003 versus 1.031	Not stated
				Scikit-learn MLP		*0.978 versus* 1.031**	Not stated
				PyTorch MLP*		*0.893 versus* 1.031**	Not stated
Miller et al., 2022³⁹	Historical average	Procedure performed, surgeon, type of case by subspecialty, and surgery case status (day surgery versus inpatient)	RMSE, MAE	XGBoost		39.3 versus 46.3	Yes, P < 0.001	Application of a machine learning algorithm to historical otolaryngology case data enabled significant improvement in the prediction of OR case duration. Such methods have the potential to increase case duration accuracy and may result in significant cost savings
Miller et al., 2022³⁹	Historical average		RMSE, MAE	CatBoost*		38.2 versus 46.3	Yes, P < 0.001
Witvoet et al., 2023⁴¹	Historical average	Average surgical time, gender, number of surgeries executed by the surgeon until case date	R², RMSE, within 5 min %, within 10 min %, within 15 mis %	CatBoost*		0.53 versus 0.49	Yes, P = 0.003	The ML model developed demonstrated superior accuracy for predicting operative time using hospital, surgeon and patient data compared to historical averages. Hospitals should consider predicting operative times by means of ML algorithms. By doing so they may be able to optimize resource utilization
Witvoet et al., 2023⁴¹	Historical average		R², RMSE, within 5 min %, within 10 min %, within 15 mis %	TabNet		*0.51 versus* 0.49**	Yes, P = 0.013
Gabriel et al., 2023⁴²	Historical average and expert prediction	BMI, spine fusion	R², MAE, RMSE	Linear regression		0.34 versus −0.57	Not stated	The use of ensemble learning with patient and procedure specific features (available preoperatively) outperformed the prediction of spine surgery case duration compared to standard predictions. The implementation of ML models presents an alternative pathway to increasing efficiency and enrich patient outcomes
				Random forest		0.76 versus −0.57	Not stated
				Bagged trees		0.76 versus −0.57	Not stated
				XGBoost*		0.77 versus −0.57	Not stated

Author/Year	Control	Variable(s) with the greatest influence on prediction	Evaluation/comparison metric	AI/predictive algorithms utilized		Performance of AI model(s) versus control	AI model more accurate than control (statistically significant)	Conclusion
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	Linear regression,		45.23 versus 49.8	Not stated	The study demonstrates the efficacy of machine learning and the heteroscedastic nature of surgical duration data
Ng et al., 2017¹⁶	Historical averaging, procedure mean	Procedure type	RMSE, MAE, NLL	MLP (gamma distribution)*		*43.38 versus* 49.8**	Yes, P = 0.01
Master et al., 2017¹¹	Historical averaging	Expert prediction/primary surgeon/patient weight	R², average prediction accuracy by study-derived performance metric	Single decision tree regressor	With EP	0.42 versus 0.34	Not stated	New prediction models outclass old models and if used in conjunction with expert opinion outperform expert opinion. Potential for decision support tools to automate OT scheduling
				Single decision tree regressor	Without EP	0.28 versus 0.34	Not stated
				Random forest regressor	With EP	0.57 versus 0.34	Yes (not stated)
				Random forest regressor	Without EP	0.38 versus 0.34	Not stated
				Gradient-boosted regression trees*	With EP	0.61 versus 0.34	Yes (Not stated)
				Gradient-boosted regression trees*	Without EP	0.44 versus 0.34	Not stated
Zhao et al., 2018²²	Historical averaging	Not reported	RMSE	Multivariable linear regression,		86.8 versus 100.4	No (95% confidence intervals)	ML-based predictive models are more accurate than current methods. This will increase the number of accurately booked case durations which may reduce under- and overutilization of OTs
				Ridge regression		82.4 versus 100.4	No (95% confidence intervals)
				Lasso regression		81.3 versus 100.4	No (95% confidence intervals)
				Random forest		81.9 versus 100.4	Yes (95% confidence intervals)
				Boosted regression tree*		80.2 versus 100.4	Yes (95% confidence intervals)
				Neural network		*89.6 versus* 100.4**	No (95% confidence intervals)
Tuwatananurak et al., 2019²⁵	Historical averaging	Historical averaging	Mean absolute difference with interquartile ranges	The proprietary Leap Rail® engine uses a combination of supervised learning algorithms*		20.0 versus 27.0 (for all cases)	Yes, P = 0.0001 (for all cases)	Statistically significant improvement of an average of 7 minutes with the LeapRail engine. Post-hoc modelling suggests this could represent a 70% reduction in scheduling inaccuracy
Bartek et al., 2019¹⁸	Historical averaging and expert predictions	Average case-time duration of latest 10 surgeries at the procedure level	MAPE, R², percentage overage, percentage underage, percentage within 10%	Linear regression		36% versus 30%	Not stated	XGBoost ML models demonstrated the best results versus other ML models/current standards. Potential for surgeon-specific ML models to improve scheduling
Bartek et al., 2019¹⁸	Historical averaging and expert predictions			Extreme gradient boosting*		26% versus 30%	Not stated
Jiao et al., 2020²³	Expert prediction primarily (and historical averaging)	Scheduled duration	CRPS	Bayesian statistical method		21.2 (min) versus 32.1 (min)	Not stated	Demonstrated unstructured hospital data can be used for prediction. Advanced application of ML in this field to potentially inform an intelligent scheduling system
				Decision tree		22.1 (min) versus 32.1 (min)	Not stated
				Random forest		19.6 (min) versus 32.1 (mins)	Not stated
				Gradient boosted decision tree		19.5 (min) versus versus 32.1 (min)	Not stated
				Mixture density network*		*18.1 (min) versus* 32.1 (min)**	Not stated
Martinez et al., 2021²¹	Historical average and expert prediction	Not reported	RMSE	Linear regression		30.84 versus 26.09 (versus bagged trees)	Not stated	Bagged tree algorithms show an improved overall error rate compared with traditional methods. They recommend research on complementary periods like anaesthesia/cleaning/recovery
				Support vector machine		30.27 versus 26.09 (versus bagged trees)	Not stated
				Regression trees		27.94 versus 26.09 (versus bagged trees)	Not stated
				Bagged trees*		27.98 versus 64.34 (versus current standards)	Not stated
Strömblad et al., 2021²⁴	Historical averaging and expert predictions	Not reported	MAE	Random forest*		49.5 (mins) versus 59.3 (mins) (16.5% improvement)	Yes, P = 0.03	Implementation of an ML model significantly improved accuracy in predicting case duration and led to reduced patient wait-time, and reduced pre-surgical length of stay
Lai et al., 2021¹⁹	Historical average	Procedure type	R², MAE, percentage overage/underage and within the threshold	Linear regression		0.72 versus 0.68	Not stated	The XGB model was superior in predictive performance to the average, regression and random forest models
				Random forest		0.74 versus 0.68	Not stated
				Extreme gradient boosting*		0.77 versus 0.68	Not stated
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	Bayesian approach		20.3 min versus 37.0 min	Not stated	They suggest that ML models have a role in informing operational decisions, they are superior to classical scheduling and traditional statistical alternatives. ML may reduce the costs of surgery
Jiao et al., 2022¹⁹	Scheduled duration	Not reported	CRPS	MANN*		*13.8 min versus* 37.0 min**	*P < 0.001 (for MANN versus* Bayesian approach)**
Abbas et al., 2022⁴⁰	Mean regressor (historical average)	Renal failure, transfusion given within 72 h preoperatively, in/outpatient status, CHF status, presence of disseminated cancer (study performed on only one type of surgery)	MSE	Linear regression		0.989 versus 1.031	Not stated	The study demonstrated that both deep ML models and conventional ML models were superior to mean regression. However, there was not a significant difference between deep and conventional ML models when looking at accuracy predictions
				SGD regression		1.013 versus 1.031	Not stated
				Elastic Net		0.999 versus 1.031	Not stated
				Linear support vector machine		0.994 versus 1.031	Not stated
				K-nearest neighbours		1.156 versus 1.031	Not stated
				Decision tree		1.032 versus 1.031	Not stated
				Random forest		1.009 versus 1.031	Not stated
				AdaBoost		1.031 versus 1.031	Not stated
				XGBoost		1.003 versus 1.031	Not stated
				Scikit-learn MLP		*0.978 versus* 1.031**	Not stated
				PyTorch MLP*		*0.893 versus* 1.031**	Not stated
Miller et al., 2022³⁹	Historical average	Procedure performed, surgeon, type of case by subspecialty, and surgery case status (day surgery versus inpatient)	RMSE, MAE	XGBoost		39.3 versus 46.3	Yes, P < 0.001	Application of a machine learning algorithm to historical otolaryngology case data enabled significant improvement in the prediction of OR case duration. Such methods have the potential to increase case duration accuracy and may result in significant cost savings
Miller et al., 2022³⁹	Historical average		RMSE, MAE	CatBoost*		38.2 versus 46.3	Yes, P < 0.001
Witvoet et al., 2023⁴¹	Historical average	Average surgical time, gender, number of surgeries executed by the surgeon until case date	R², RMSE, within 5 min %, within 10 min %, within 15 mis %	CatBoost*		0.53 versus 0.49	Yes, P = 0.003	The ML model developed demonstrated superior accuracy for predicting operative time using hospital, surgeon and patient data compared to historical averages. Hospitals should consider predicting operative times by means of ML algorithms. By doing so they may be able to optimize resource utilization
Witvoet et al., 2023⁴¹	Historical average		R², RMSE, within 5 min %, within 10 min %, within 15 mis %	TabNet		*0.51 versus* 0.49**	Yes, P = 0.013
Gabriel et al., 2023⁴²	Historical average and expert prediction	BMI, spine fusion	R², MAE, RMSE	Linear regression		0.34 versus −0.57	Not stated	The use of ensemble learning with patient and procedure specific features (available preoperatively) outperformed the prediction of spine surgery case duration compared to standard predictions. The implementation of ML models presents an alternative pathway to increasing efficiency and enrich patient outcomes
				Random forest		0.76 versus −0.57	Not stated
				Bagged trees		0.76 versus −0.57	Not stated
				XGBoost*		0.77 versus −0.57	Not stated

All studies used several different algorithms to predict the case-time duration from the data provided. Except for the article by Tuwatananurak et al., all studies state which algorithms were the most accurate through their testing²⁵. Tree-based ML models (Fig. 3) account for nine of the 14 best models^{11,18,20–22,24,39,41,42}, five of which used some form of gradient boosting^{11,18,20,22,39,42}, and one used a bagging method²¹. Six further studies demonstrated improvements in predictive power with DL models (multilayer perceptron, MLP)^16,40, a mixture density network (MDN)²³, TabNet⁴¹ and a modular artificial neural network (MANN)¹⁹.

Fig. 3

Demonstrating the simple structure of a tree-based algorithm with tree terminology

Efficiency savings

Three studies discussed efficiency savings; however, this was too heterogeneous to present in a table. The findings are presented in the discussion.

Discussion

Among the 14 studies identified, all developed ML algorithms and compared their accuracy to the current industry standards, but only one reported the time efficiency savings that can potentially be achieved by the implementation of such a system²⁴. Eleven of the 14 studies were conducted in the USA^{11,16,18,19,22–25,39,41,42}, with an overlap of authorship only from Jiao et al.^19,23.

All studies reported the development of an ML model that was more accurate than the current industry standards (historical averaging or surgeon estimation^{11,16,18–25,39–42}). Seven of the studies were able to demonstrate that the differences in predictive capability were significant (P < 0.05)^{16,19,22,24,25,39,41}. This is suggestive of the superior accuracy of ML models in the prediction of surgical case duration. Nine of the studies demonstrated that tree-based ML models were the most accurate as opposed to standard prediction methods and other ML algorithms^{11,18,20–22,24,39,41,42}. Tree-based methods are a type of supervised ML model that are popular due to their capacity to handle multifaceted data and their capacity to understand complex or non-linear relationships. These models work by segmenting the predictor space into several simple subsets⁴³. There have been advancements in this technique with the development of ‘ensemble learning’ methods including ‘random forests (RF)’⁴⁴, ‘gradient boosting (GBM)’⁴⁵, ‘eXtreme gradient boosting (XGB)’⁴⁶ and CatBoost⁴⁷, which build upon a simple decision tree by aggregating the results of multiple developed trees and combining them. This can provide greater predictive accuracy and more robust models.

Of the studies in this review, only six^{16,19,22,23,40,41} produced models that pertain to ‘deep learning’ techniques. Zhao et al.²² demonstrated that DL models are not always superior to their ML predecessors; their neural network was the worst-performing model in the study with an RMSE of 89.6 versus 80.2 for boosted regression trees (linear regression outperformed the neural network with an RMSE of 86.8)²². However, the number of records included to train the models was only 424. Jiao et al.²³ utilized a much larger data set of 52 735 records and showed that their mixture density network (a type of DL model) demonstrated the lowest CRPS of all the models of 18.1 (mins) versus 19.5 (mins) for the next closest model, gradient boosted regression trees. In the four studies^16,19,23,40 that trained DL models using data sets with over 50 000 records, all found that DL models were the most accurate prediction method compared to ML models and the current industry standards, supporting the importance of DL models training on larger data sets.

Jiao et al.¹⁹ developed a unique approach to predicting the duration of surgery. A DL model was developed that continually incorporated preoperative variables as well as intraoperative variables, including vital signs, medications and intraoperative events/stages, called a modular artificial neural network (MANN). The model produced predictions at each fifth percentage of the total duration. This model was compared to the industry standard mean regressor and a simpler model using Bayes theorem. The CRPS (a measure of accuracy) of the MANN improved nearer to the end of the surgery and was statistically significantly better than both the Bayesian method and the mean regressor at all reported time intervals. Jiao et al.¹⁹ further reported that their algorithm could predict impending staff overtime, allowing for better resource management.

The number of variables, features or data sources used within each study varied from seven to over 1500^21,25. The most common variables were: primary surgeon, historic average surgical duration, the experience of the surgeon, procedure name, the number the procedure lies within the list, type of anaesthesia, duration of the case, patient BMI, patient age, ASA score, patient sex, patient co-morbidities and anaesthesia provider (consultant/junior).

More variables would be expected to allow for more accurate predictions; however, this is not always the case because redundant variables may increase ‘noise’^48,49. Data science practitioners engage in ‘feature selection’ to identify useful variables and remove those causing noise^48,49. Another issue is the quality of variable recording and when the data are recorded. For example, ASA may be recorded on the day of surgery, making it unhelpful for planning.

Master et al.¹¹ found that the ASA score had low importance within their models, suggesting this is because important information within the ASA score may already be coded more clearly within other variables, such as the patient’s weight. Within their model, patient weight had a much larger effect on the outcome across all models compared to the ASA score. This review has shown no significant difference in model accuracy between models using 8 variables compared to over 1500 variables^11,25.

Two studies predicted the duration of surgery and length of stay for only one procedure, eliminating the ‘procedure performed’ variable^40,41. The model with the lowest testing MSE for Abbas et al. was a PyTorch MLP⁴⁰. The most important features were the presence of renal failure, followed by blood transfusion given within 72 h preoperatively, in-/outpatient status, congestive heart failure status and the presence of disseminated cancer. This confirms that patient factors need to be taken into consideration when defining appropriate data sets for algorithms.

Three studies included in this review discussed efficiency or cost savings directly. Stromblad et al.²⁴ conducted a prospective interventional RCT with patients randomized to either ML or industry-standard methods to predict the duration of surgery. The resulting predictions were fed into a scheduling workflow for patients undergoing colorectal and gynaecological surgery. Patients assigned to the ML algorithm had a significantly lower MAE of 9.8 min (P = 0.03) for duration predictions. Some of the operational benefits noted were that average patient wait times were significantly reduced from 49.4 min to 16.3 min (67.1 per cent improvement) by the utilization of an ML model. The number of large error predictions (MAE > 60 min) was reduced by 8.3 per cent, a significant result as these kinds of large errors can disrupt a day in the OT leading to case cancellation or extended overtime²⁴.

Tuwatananurak et al.²⁵ reported a cumulative reduction of 19 000 min or 70 per cent in scheduling inaccuracy over a 3-month period across the two surgical suites in their medical centre. They identify that the average cost per minute for an operating theatre was estimated between $22 and $133 in the USA at the time of publication^50,51, highlighting the possible cost savings achievable.

Jiao et al. performed a post-hoc analysis of overtime prediction and found that in the 960 cases that overran in the test data, the ML model correctly identified 110 cases more that were going to run overtime compared to the standard prediction method. These cases overran by an average of 154 min. They suggested that if only 10 per cent of overrun cases were identified beforehand, this could avoid 28.2 h of overtime pay by planning for appropriate timely staff handovers during the month sampled⁵². Tuwatananurak and Jiao reported efficiency savings on post-hoc analysis; only Stromblad et al. confirmed the benefits of ML through a prospective study.

Many of the included studies were of poor quality. Eleven of 14 (75 per cent) studies^{11,16,18,21–25,39,41,42} did not discuss or report any external validation practices on their models, limiting the algorithm’s use outside of the host institution. Only one of the included studies performed external validation¹⁹. Jiao et al.¹⁹ generated a database from a different but local hospital to test and train their algorithm on. However, Abbas et al.⁴⁰ utilized data from an American national database and internally validated it by splitting the data according to years, providing generalizability to the whole country.

Although both techniques improve the generalization of the algorithms, they require testing on wider data. Three of 14 studies^22,25,39 did not clearly state how missing data were managed; mismanagement of missing data can lead to reduced statistical power and create bias within the results following ML analysis^53–55. Tuwatananurak et al.²⁵ did not disclose how the model(s) were trained, tested or validated. This information is required to reproduce study findings and for critical appraisal. Four studies^16,18,21,25 did not include baseline reporting of sample characteristics to allow the reader to confirm whether randomization or splitting of the data was appropriate^48,49.

Several studies^11,22,24 used small data sets (<1000 records used) to train their algorithms, leaving the algorithms susceptible to overfitting (aligning too closely to statistical noise, rendering predictions on new data poor) and being inadequately powered. It is clear that larger data sets are necessary, including national databases; however, this requires further work on improving the connectivity and accessibility of such data. Standardization could allow different researchers to access large data sets from multiple centres, improving algorithm development and reducing the limitation of only accessing data from one or two data sites/centres.

Generally, there is the challenge of comparing the performance of algorithms due to methodological heterogeneity; evaluation metrics varied between studies. Most studies lacked external validation, meaning that these algorithms may not be applicable to other institutions. Studies often lacked detail on the technical aspects of the ML models used. Finally, some articles, unpublished or not indexed, may have been missed.

The authors note that issues of implementing ML models into clinical practice in the papers identified are sparse. Some issues are highlighted relating to real-time data pipelines¹⁹ and how categorical data are encoded into the models¹¹, and Strömblad et al. discuss ensuring data availability weeks prior to surgery being a requirement for a prospective model²⁴. The authors recommend that implementation issues are considered in future studies and that implementation research could be utilized to enhance study protocol design^56–58.

The models demonstrated in this review produced predictions retrospectively or one day in advance. Implementing AI-based case duration predictions in surgical centres requires models to predict surgical duration weeks in advance.

This is one of the first studies to utilize Farrow et al.’s³⁷ proposed standardized reporting of predictive ML research and the proposed TRIPOD-AI⁵⁹ statement to improve the conduct, reproducibility and comparability of further surgical-AI research.

Despite the development of ML algorithms, it is remarkable that only 14 suitable studies of low-level evidence, published since 2017, have been identified for inclusion in this review^26,60. This is likely due to multiple barriers including poor-quality data (collection and recording)^61,62, lack of standardization and the use of multiple systems which lack compatible formats/interoperability^51,52. The industry should be encouraged to develop an open application programming interface (API) standard.

Furthermore, the skills required to create and implement a successful healthcare-based ML model require input from computer scientists, surgeons, anaesthetists and medical managers; unless there is a dedicated research unit, it may be difficult to maintain cooperation between these diverse teams⁶³.

Obtaining ethical approval provides a further barrier that many tech-based solutions encounter when attempting to acquire the large quantities of data that are required for ML/DL models⁶⁴. Numerous ethical issues include accountability for errors that arise from the use of an ML algorithm⁶⁵. Who is to blame, the physician, the algorithm designers or the institution purchasing the technology⁶⁵?

Implementing AI solutions requires addressing both human factors and technological factors. Technological factors included better data collection, extraction from current EHR systems²⁵, categorization and pipelines⁶⁶. The human factors include people having capabilities, opportunities and the motivation to ensure that such systems function adequately. Allocating resources to close these gaps is vital to the successful implementation of AI solutions in healthcare systems⁶⁴. ML is only one facet of improving theatre utilization. Other important factors include developing protocols to reduce the turnover time between patients⁶⁷. Other benefits may be realized through the implementation of ML models into electronic theatre scheduling pathways. These include the introduction of new technologies and staff training, which may encourage further improvements related to efficiency⁶⁴.

To integrate these modern solutions the NHS needs to demand and drive standardization of our technological resources. Software developers must create APIs that allow ML models to access the data they require. New standards should be developed regarding the implementation and development of open-source APIs that allow for secure data extractions and interoperability between different software packages. To support this drive and deliver the changes successfully, the NHS must also consider human factors such as staff education⁶⁴.

These studies suggest that DL and ML models can be used to predict surgical case duration and they will perform more accurately than the current industry standards. There is early evidence to suggest these improvements in accuracy will lead to efficiency and cost benefits, but more work is needed to identify the best way to implement these models.

Author contributions

Christopher Spence (Conceptualization, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review & editing), Owais Shah (Conceptualization, Investigation, Validation, Writing—review & editing), Anna Cebula (Writing—review & editing), John Tucker (Writing—review & editing), David Sochart (Writing—review & editing), Deiary Kader (Writing—review & editing), and Vipin Asopa (Conceptualization, Methodology, Project administration, Supervision, Writing—review & editing)

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Disclosure

The authors have no conflicts of interest to declare.

Supplementary material

Supplementary material is available at BJS Open online.

Data availability

The authors confirm that the data supporting the findings of this study are available within the article and its Supplementary Materials. Other resources can be made available upon request.

References

Iacobucci

NHS “under pressure from all sides” as waiting list reaches seven million

BMJ

2022

;

379

o2471

PubMed

Iacobucci

COVID-19: all non-urgent elective surgery is suspended for at least three months in England

BMJ

2020

;

368

m1106

PubMed

: https://improvement.nhs.uk/documents/3711/Theatre_productivity_report__Final.pdf

Howlett

Wood

Modeling the recovery of elective waiting lists following COVID-19: scenario projections for England

Value Health

2022

;

1805

–

1813

Statistics

Consultant-led Referral to Treatment Waiting Times Data 2023–24. Available from

: https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/rtt-data-2023-24/#May23 (accessed 10 August 2023)

NHS Backlogs and Waiting Times in England—National Audit Office (NAO) Press Release. Available from

: https://www.nao.org.uk/press-release/nhs-backlogs-and-waiting-times-in-england/ (accessed 11 January 2022)

NHS Improvement

Operating Theatres: Opportunities to Reduce Waiting Lists. London

;

2019

Available from

Wise

Improve scheduling to perform an extra 291 000 elective operations a year, trusts are told

BMJ

2019

;

364

1540

Pandit

Rational planning of operating lists: a prospective comparison of ‘booking to the mean’ vs. ‘probabilistic case scheduling’ in urology

Anaesthesia

2020

;

642

–

647

Pandit

Tavare

Using mean duration and variation of procedure times to plan a list of surgical operations to fit into the scheduled list time

Eur J Anaesthesiol

2011

;

493

–

501

Macario

Are your hospital operating rooms “efficient”? A scoring system with eight performance indicators

Anesthesiology

2006

;

105

237

–

240

Master

Zhou

Miller

Scheinker

Bambos

Glynn

Improving predictions of pediatric surgical durations with supervised learning

Int J Data Sci Anal

2017

;

–

Stepaniak

Heij

de Vries

Modeling and prediction of surgical procedure times

Stat Neerl

2010

;

–

Edelman

van Kuijk

SMJ

Hamaekers

AEW

de Korte

MJM

van Merode

Buhre

WFFA

Improving the prediction of total surgical procedure time using linear regression modeling

Front Med (Lausanne)

2017

;

Hosseini

Sir

Jankowski

Pasupathy

Surgical duration estimation via data mining and predictive modeling: a case study

. In:

Vol. 2015, AMIA Annual Symposium Proceedings. AMIA Symposium

2015

Stepaniak

Heij

Mannaerts

GHH

de Quelerij

de Vries

Modeling procedure and surgical times for current procedural terminology–anesthesia–surgeon combinations and evaluation in terms of case-duration prediction and operating room efficiency: a multicenter study

Anesth Analg

2009

;

109

1232

–

1245

Gabriel

McAuley

Elkan

Lipton

Predicting Surgery Duration with Neural Heteroscedastic Regression

2017

; Available from

: https://arxiv.org/abs/1702.05386v3 (accessed 21 October 2021)

Olsen

Improvement of Surgery Duration Estimation Using Statistical Methods and Analysis of Scheduling Policies Using Discrete Event Simulation

2015

; Available from

: https://lib.dr.iastate.edu/etd/14497/ (accessed 12 April 2022)

Bartek

Saxena

Solomon

Fong

Behara

Venigandla

et al.

Improving operating room efficiency: machine learning approach to predict case-time duration

J Am Coll Surg

2019

;

229

346

–

354.e3

Jiao

Xue

Avidan

Kannampallil

Continuous real-time prediction of surgical case duration using a modular artificial neural network

Br J Anaesth

2022

;

128

829

–

837

Lai

Huang

Liu

Cho

Improving and interpreting surgical case duration prediction with machine learning methodology

J Anesth Clin Res

2021

;

998

Martinez

Parra

Rugeles

Suarez

Machine learning for surgical time prediction

Comput Methods Programs Biomed

2021

;

208

106220

Zhao

Waterman

Urman

Gabriel

A machine learning approach to predicting case duration for robot-assisted surgery

J Med Syst

2019

;

Jiao

Sharma

Abdallah

Maddox

Kannampallil

Probabilistic forecasting of surgical case duration using machine learning: model development and validation

J Am Med Inform Assoc

2020

;

1885

–

1893

Strömblad

Baxter-King

Meisami

Yee

Levine

Ostrovsky

et al.

Effect of a predictive model on planned surgical duration accuracy, patient wait time, and use of presurgical resources: a randomized clinical trial

JAMA Surg

2021

;

156

315

–

321

Tuwatananurak

Zadeh

Vacanti

Fulton

Ehrenfeld

et al.

Machine learning can improve estimation of surgical case duration: a pilot study

J Med Syst

2019

;

Bohr

Memarzadeh

The rise of artificial intelligence in healthcare applications

Artif Intell Healthc

2020

;

–

Myers

Ramkumar

Ricciardi

Urish

Kipper

Ketonis

Current concepts review artificial intelligence and orthopaedics: an introduction for clinicians

J Bone Joint Surg Am

2020

;

102

830

–

840

Han

S-H

Kim

Youn

Artificial neural network: understanding the basic concepts without mathematics

Dement Neurocogn Disord

2018

;

Miller

Brown

Artificial intelligence in medical practice: the question to the answer?

Am J Med

2018

;

131

129

–

133

Maier-Hein

Vedula

Speidel

Navab

Kikinis

Park

et al.

Surgical data science for next-generation interventions

Nat Biomed Eng

2017

;

691

–

696

Chilamkurthy

Ghosh

Tanamala

Biviji

Campeau

Venugopal

et al.

Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study

Lancet

2018

;

392

2388

–

2396

Xiao

Choi

Sun

Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review

J Am Med Inform Assoc

2018

;

1419

–

1428

Tricco

Lillie

Zarin

O’Brien

Colquhoun

Levac

et al.

PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation

Ann Intern Med

2018

;

169

467

–

473

Arksey

O’Malley

Scoping studies: towards a methodological framework

Int J Soc Res Methodol

2007

;

–

Peters

MDJ

Godfrey

Khalil

McInerney

Parker

Soares

Guidance for conducting systematic scoping reviews

Int J Evid Based Healthc

2015

;

141

–

146

OCEBM Levels of Evidence—Centre for Evidence-Based Medicine (CEBM), University of Oxford. Available from

: https://www.cebm.ox.ac.uk/resources/levels-of-evidence/ocebm-levels-of-evidence (accessed 20 January 2022)

Farrow

Zhong

Ashcroft

Anderson

Meek

RMD

Interpretation and reporting of predictive or diagnostic machine-learning research in Trauma & Orthopaedics

Bone Joint J

2021

;

103-B

1754

–

1758

Haddaway

McGuinness

Pritchard

PRISMA2020: R Package and ShinyApp for Producing PRISMA 2020 Compliant Flow Diagrams

2021

; Available from

: https://zenodo.org/record/5082518 (accessed 30 November 2021)

Miller

Goedicke

Crowson

Rathi

Naunheim

Agarwala

Using machine learning to predict operating room case duration: a case study in otolaryngology

Otolaryngol Head Neck Surg

2023

;

168

241

–

247

Abbas

Mosseri

Lex

Toor

Ravi

Khalil

et al.

Machine learning using preoperative patient factors can predict duration of surgery and length of stay for total knee arthroplasty

Int J Med Inform

2022

;

158

104670

–

105056

Witvoet

de Massari

Shi

Chen

Leveraging large, real-world data through machine-learning to increase efficiency in robotic-assisted total knee arthroplasty

Knee Surg Sports Traumatol Arthrosc

2023

;

3160

–

3171

Gabriel

Harjai

Simpson

Tully

George

et al.

An ensemble learning approach to improving prediction of case duration for spine surgery: algorithm development and validation

JMIR Perioper Med

2023

;

e39650

Carvalho

Santos

JPV

Torres

Santarém

Fonseca

Tree-based methods: concepts, uses and limitations under the framework of resource selection models

J Environ Inform

2018

;

112

–

124

Liaw

Wiener

Classification and regression by randomForest

R News

2002

;

–

Ridgeway

gbm: Generalized boosted regression models. R Package Version

2006

;1:55

Chen

Benesty

Khotilovich

Tang

Cho

et al. Xgboost: extreme gradient boosting. R Package Version 0.4-2

2015

;1:1–4

Dorogush

Ershov

Gulin

CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363. 2018 Oct 24

Chowdhury

Turin

Variable selection strategies and its importance in clinical prediction modelling

Fam Med Com Health

2020

;

262

Guyon

Elisseeff

Kaelbling

An introduction to variable and feature selection

J Mach Learn Res

2003

;

1157

–

1182

Macario

What does one minute of operating room time cost? J Clin Anesth

2010

;

:233–236

Childers

Maggard-Gibbons

Understanding costs of care in the operating room

JAMA Surg

2018

;

153

e176233

May 2021 National Occupational Employment and Wage Estimates. Available from

: https://www.bls.gov/oes/current/oes_nat.htm#29-0000 (accessed 12 September 2022)

Kang

The prevention and handling of the missing data

Korean J Anesthesiol

2013

;

402

Nijman

SWJ

Leeuwenberg

Beekers

Verkouter

Jacobs

JJL

Bots

et al.

Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review

J Clin Epidemiol

2022

;

142

218

–

229

Emmanuel

Maupong

Mpoeleng

Semong

Mphago

Tabona

A survey on missing data in machine learning

J Big Data

2021

;

–

Proctor

Silmere

Raghavan

Hovmand

Aarons

Bunger

et al.

Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda

Adm Policy Ment Health

2011

;

Powell

Waltz

Chinman

Damschroder

Smith

Matthieu

et al.

A refined compilation of implementation strategies: results from the Expert Recommendations for Implementing Change (ERIC) project

Implement Sci

2015

;

Powell

McMillen

Proctor

Carpenter

Griffey

Bunger

et al.

A compilation of strategies for implementing clinical innovations in health and mental health

Med Care Res Rev

2012

;

123

Collins

Dhiman

Andaur Navarro

Hooft

Reitsma

et al.

Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence

BMJ Open

2021

;

e048008

Janiesch

Zschech

Heinrich

Machine learning and deep learning

Electron Markets

2021

;

685

–

695

: https://topol.hee.nhs.uk/

Chu

Zeng

Zhao

Barriers and facilitators to data quality of electronic health records used for clinical research in China: a qualitative study

BMJ Open

2019

;

e029314

Paul

Greene

Newton-Dame

Thorpe

Perlman

McVeigh

et al.

The state of population health surveillance using electronic health records: a narrative review

Popul Health Manag

2015

;

209

–

216

Murphy

Killen

Burnham

Sarvari

Brown

Artificial intelligence accurately identifies total hip arthroplasty implants: a tool for revision surgery

Hip Int

2022

;

766

–

770

Topol

The Topol Report: Preparing the Healthcare Workforce to Deliver the Digital Future. London

;

2019

Available from

Gundersen

Bærøe

The future ethics of artificial intelligence in medicine: making sense of collaborative models

Sci Eng Ethics

2022

;

–