-
PDF
- Split View
-
Views
-
Cite
Cite
Michael F Byrne, Remo Panaccione, James E East, Marietta Iacucci, Nasim Parsa, Rakesh Kalapala, Duvvur N Reddy, Hardik Ramesh Rughwani, Aniruddha P Singh, Sameer K Berry, Ryan Monsurate, Florian Soudan, Greta Laage, Enrico D Cremonese, Ludovic St-Denis, Paul Lemaître, Shima Nikfal, Jerome Asselin, Milagros L Henkel, Simon P Travis, Application of Deep Learning Models to Improve Ulcerative Colitis Endoscopic Disease Activity Scoring Under Multiple Scoring Systems, Journal of Crohn's and Colitis, Volume 17, Issue 4, April 2023, Pages 463–471, https://doi.org/10.1093/ecco-jcc/jjac152
- Share Icon Share
Abstract
Lack of clinical validation and inter-observer variability are two limitations of endoscopic assessment and scoring of disease severity in patients with ulcerative colitis [UC]. We developed a deep learning [DL] model to improve, accelerate and automate UC detection, and predict the Mayo Endoscopic Subscore [MES] and the Ulcerative Colitis Endoscopic Index of Severity [UCEIS].
A total of 134 prospective videos [1550 030 frames] were collected and those with poor quality were excluded. The frames were labelled by experts based on MES and UCEIS scores. The scored frames were used to create a preprocessing pipeline and train multiple convolutional neural networks [CNNs] with proprietary algorithms in order to filter, detect and assess all frames. These frames served as the input for the DL model, with the output being continuous scores for MES and UCEIS [and its components]. A graphical user interface was developed to support both labelling video sections and displaying the predicted disease severity assessment by the artificial intelligence from endoscopic recordings.
Mean absolute error [MAE] and mean bias were used to evaluate the distance of the continuous model’s predictions from ground truth, and its possible tendency to over/under-predict were excellent for MES and UCEIS. The quadratic weighted kappa used to compare the inter-rater agreement between experts’ labels and the model’s predictions showed strong agreement [0.87, 0.88 at frame-level, 0.88, 0.90 at section-level and 0.90, 0.78 at video-level, for MES and UCEIS, respectively].
We present the first fully automated tool that improves the accuracy of the MES and UCEIS, reduces the time between video collection and review, and improves subsequent quality assurance and scoring.
Background and context: Endoscopic assessment and scoring the disease severity in UC is limited by inter-observer variability and lack of clinical validation.
New findings: We present the first fully automated AI model for UC disease activity scoring under both the MES and UCEIS, at frame, section and video levels, and which is ready for use in clinical practice. Our model improves the accuracy of both scoring systems, reduces the time between video collection and review, and improves subsequent quality assurance and scoring.
Limitations: Limitations include a limited dataset with imbalanced classes, limited generalizability, difficulty in describing a fair comparison with the literature due to the lack of an open dataset, and subjective ground truth for MES and UCEIS resulting in potential bias for the labellers reviewing AI-generated sections with a GUI.
Impact: Our results enable the development of a model that can be used to improve the efficiency and accuracy of UC endoscopic assessment and scoring at different stages of the clinical journey, such as video quality assurance by physicians, and increase the efficiency of central reading in clinical trials.
1. Introduction
Ulcerative colitis [UC] is a chronic inflammatory disease of the colon and rectum with increasing incidence and prevalence worldwide.1 Several treatment options are available for UC, based on disease activity, severity and prior response to medical treatments.2 In patients with UC, disease activity and severity can be assessed using inflammatory markers, clinical symptom scores, endoscopic inflammation scores and histological scoring systems.3–7 One of the main goals of therapy in patients with UC is to achieve ‘mucosal healing’, which has been shown to be associated with decreased rates of steroid use, hospitalization, colectomy and improved quality of life.8 The status of mucosal inflammation during colonoscopy can be reported with scoring systems such as the Mayo Endoscopic Subscore [MES] and the Ulcerative Colitis Endoscopic Index of Severity [UCEIS].9,10 The MES 0-1 has been reported to be associated with improved rates of clinical remission, while the UCEIS score has been shown to be a more accurate reflection of UC severity and clinical remission, and of the short- and long-term clinical outcomes: clinical remission [UCEIS 0–1], mild [UCEIS 2–4], moderate [UCEIS 5–6] and severe [UCEIS 7–8].11–14
While disease severity scoring systems are established, the presence of inter-observer variability and lack of clinical validation remain two important limitations of endoscopic assessment and scoring of disease severity in patients with UC.10 To overcome these limitations and improve the inter-observer agreement, central reading by clinically blinded off-site experienced endoscopists, ‘Central Readers [CRs]’, has been used as a crucial component in UC clinical trials.15 Recently, artificial intelligence [AI] has been utilized to enhance the interpretation of endoscopic images to assess disease severity in patients with UC and to strive at reducing the delay and cost associated with central reading activity.16,17 Studies have shown encouraging results in the application of deep learning [DL] models in the UC diagnostic paradigm to improve the disease activity scoring, especially in central reading for clinical trials when using the MES.16–20 Stidham et al. built a DL model which successfully distinguished between active disease [UCEIS 2–4] and endoscopic remission [UCEIS 0–1] from colonoscopy videos and were able to identify the exact Mayo Endoscopic Subscores with comparable accuracy to three experienced human reviewers.18 However, the UCEIS scoring system, which focuses on design features that minimize inter-observer variability, may potentially allow for training models with superior assessment and scoring ability if all the features of the UCEIS score are used, rather than just binary distinctions between active disease and endoscopic remission. Therefore, we developed a DL model to improve, accelerate and automate UC disease detection. Specifically, we trained several convolutional neural network [CNN] models to pre-process endoscopic recordings with the final goal of assessing the MES, UCEIS and its three descriptor indices against video sections. Sections are automatically generated in continuous intervals of recording depicting a stable, observable disease state. A user-friendly and interactive graphical user interface [GUI] was designed to show the results of the various CNN models to help experts efficiently assign UC disease activity.
2. Materials and Methods
2.1. General
The aim of the models developed is to predict the MES, UCEIS and its three descriptor indices from the reviewed sections.
2.2. Data collection
A total of 134 unaltered and de-identified colonoscopy and sigmoidoscopy videos of UC patients were collected with Olympus HDR-60 scopes [190 and 180 Series] between October 2020 and April 2021 in the Asian Institute of Gastroenterology [AIG Hospitals]. Institutional review board approval was obtained for this study [AIGAEC-BH&R 08/10/2020-03]. These videos were encoded with the YUV420p pixel format and were de-interlaced at 25 frames per second, resulting in a total dataset of 1550 030 frames. In addition, 100 videos were collected for the purpose of GUI evaluation. We carried out various simulations to estimate the required sample size for our model’s accuracy. In brief, we generated 10 000 samples of random discrete uniformly distributed UCEIS scores with independent random normally distributed errors to simulate the model’s error.
2.3. First step: video quality assessment
Poor-quality videos with poor bowel preparation, ex vivo footage, and that were out-of-focus and provided almost no visible mucosa were filtered out in this step.
2.4. Second step: pre-processing pipeline application
The pre-processing pipeline consisted of four sub-steps as described in Figure 1. With the purpose of keeping only white-light frames with normal magnification, a heuristic to identify image-enhancement using blue light imaging based on pixel colour was applied to all frames.

Our fully automated ulcerative colitis detection and scoring decision support methodology.
Scorable frames were defined such that they easily allow the assessment of UC activity. Non-scorable frames were considered as challenging frames for scoring UC due to either faeces and/or water jet presence, a visible biopsy tool and post-biopsy bleeding [not to be confused with blood from the disease itself]. They can also include ex vivo and out-of-focus [shadowed, too close to the mucosa or blurred] frames. For this task, we employed a CNN which outputs the probability of a frame being scorable, and we kept the frames that met a given threshold.
The MES, UCEIS and the three descriptors for UCEIS [Vascular Pattern, Bleeding Erosions and Ulceration] were then assigned to scorable frames detected ‘as is’ by the scorability assignment model with a dedicated CNN. The predicted scorable images were used to create continuous stable disease state sections in the next step of the workflow.
2.5. Third step: section generation
To mimic the performance of experts in the reviewing process for UC disease assessment, we broke down endoscopy videos into short sections of continuous frames representing stable disease states to score coherent parts of the videos. Decomposition into such sections was motivated by two goals: the need to stabilize the endoscopic video review process by expert readers, and to also develop an efficient hierarchical in-house labelling system utilizing the expertise of an internal specialist team. This team consisted of one global central reading expert [S.P.T., the clinician who first described the UCEIS scoring system, gold standard], six gastrointestinal [GI] specialists [silver standard] and 20 GI trainees [bronze standard]. We developed an algorithm in which inputs were the frames that were assigned a score in the last step of the preprocessing pipeline, and which had output sections varying from 3 to 20 s. The algorithm was defined so that it created coherent and long enough sections by using consecutive frames with continuous scores that were smoothed to avoid outliers. Sections were created from all the scored frames, whether they were scorable or non-scorable. Our sections generated from the algorithm were short sections of continuous and mainly scorable frames representing stable disease states as explained in the previous paragraph. In addition, the section creation process was developed as an offline process to be executed on recorded videos.
2.6. Fourth step: utilizing a GUI
The previous steps of the DL approach enabled the development of a GUI displaying videos and their respective created sections for review. The web-based interface was built with sequential ordered steps to optimize the endoscopic videos review workflow, as described in Figure 2.

Reviewers can first access the tool whenever endoscopic videos from patients with UC need to be reviewed [step 1 in Figure 2]. Frames from videos under review have gone through the automated preprocessing pipeline and were estimated as either non-scorable or scorable. Scorable frames underwent evaluation by the section creation and refinement algorithms such that newly created and refined sections could be represented along with the non-scorable frames in a timeline under the video in review [step 2 in Figure 2]. For continuous sections representing a stable disease state, gradient colours were used to highlight the severity of the UC. The tool featured the presentation—in decreasing order—of the first high disease activity section down to the last one based on MES and UCEIS scores so that the readers save time by reviewing only the relevant sections of the video to confirm the score assigned to each section and the whole recording [step 3 in Figure 2]. If needed, users can tag specific features in the reviewed video such as colonoscope trauma, biopsy blood and poor bowel preparation, all of which can be utilized later to optimize the preprocessing pipeline [step 4 in Figure 2]. Once the user scores a section that is at least 2 points higher than any of the remaining AI-scored sections, the video will be assigned the highest UCEIS score and MES score [step 5 in Figure 2]. Although not shown in Figure 2, the highest MES is also displayable by configuration. This results in a live, simple, interactive and user-friendly application that can be used to improve the reviewed workflow, speeding up the reading process while improving the accuracy of UC disease activity assessment.
The GUI was used to obtain high-quality labels at a section-level. Thus, the tool was used by GI specialists [gold and silver standards] to review each generated sections to either confirm or refute the estimated MES, UCEIS and the three UCEIS descriptors as required. At least two reviewers scored every section used in the training phase. Utilization of the sections built from raw labels and reviewed by medical experts resulted in a faster review process, and a large quantity of high-quality ground truth labels. The latter was used to train the section-based severity assignment model in the final step of the approach. Of note, the GUI can also be used as a review tool by CRs to automatically characterize UC disease activity in endoscopic videos.
2.7. Fifth step: disease severity assessment
2.7.1. Dataset creation
In the final step of the DL workflow, we trained a CNN referred to as ‘Section-based Disease Assessment [SDA]’. The data used to develop the model contained frames from each section that was assigned an MES score, or UCEIS and its three descriptors. The scores of a section were assigned to all the frames which made up that section. The dataset used in this final step of the workflow was split at a video-level into training, validation and test sets in a 60–20–20% distribution, with no overlap of videos used in these three sets.
2.7.2. Model generation
We developed a CNN model taking frames from reviewed sections as input, which then outputs continuous scores for MES, UCEIS and its three descriptors, also at frame-level. The objective was to provide a precise score to help UC disease severity assessment as well as to provide granular reporting of results. The CNN-based model is an EfficientNetB3 architecture with weights pre-trained on ImageNet. We appended a global average pooling layer and dense layers to this network. The output layer contained five separate dense layers predicting continuous scores at frame-level according to their respective scale: MES, aggregate UCEIS and the individual UCEIS descriptors: Vascular Pattern, Bleeding, and Erosions and Ulceration. Thus, for each input frame, the model predicted five scores. The model’s high-level architecture is shown in Figure 3.

High-level architecture of the SDA CNN model predicting MES, UCEIS and its descriptors. SDA, Disease Severity Assessment; CNN, convolutional neural network; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
2.7.3. Model training
Various iterations of the architecture were evaluated by varying the random seed and the hyperparameters such as the loss function, dropout rate, learning rate, number of epochs and optimizer to both ensure high precision results and prevent overfitting. In addition to the hyperparameter search, an architecture search was also performed, exploring a variety of CNN architectures and dense layer configurations. Note that while all models were assessed on a validation set, all results shown in this paper are based on a separate hold-out/test set, unseen during training or validation. Model generation and training experiments were performed using TensorFlow.
Throughout the DL workflow, we developed an iterative approach to improve the section definition and thus the model’s predictions. With the SDA CNN’s outputs based on the first labels from the bronze standard, we updated the heuristic that defined section boundaries in videos. These updated section definitions had greater autocorrelation, resulting in more consistent and accurate section-level reads. Once these new sections were reviewed by the silver standard team and evaluated against the gold standard resource, an improved model was trained based on the new sections, allowing for an iterative approach to improving section scores. Through the iterations between section creation, section labelling and model training, we kept only the sections where the scores for MES, UCEIS and its three descriptors between the reviewers were less than a half point away so that our ground truth was as accurate as possible. The degree of discrepancy between the MES and UCEIS for a given section were negligible as the medical expert reviewers were consistent in their assessment.
2.7.4. Performance metrics
We assessed the model performance on MES, UCEIS and UCEIS descriptors at section- and video-level. To obtain section-level scores from frame-level predictions, we computed the 83rd percentile for each score over the frames belonging to a section [this number was chosen based on the results of the validation set]. The 83rd percentile was chosen as it optimized the inter-observer agreement as measured by quadratic weighted kappa [QWK] in the validation set. To infer section-level predictions at video-level, we calculated the maximum of each score over the sections belonging to a video. We assessed the performance with the metrics described below.
We considered the mean absolute error [MAE], a well-known measure of accuracy for regression problems. It is defined as the absolute difference between the model inferences and the ground truth for continuous results. The bias was then used to evaluate the direction of the performance error. It is defined as the difference between the score predicted by the model and the true label. Those metrics were assessed mainly to evaluate and compare the performance of our different regression model iterations.
Although we formulated the problem as a regression task, we proceeded to analyse the results as a classification problem because scoring conventions are on discrete MES and UCEIS values. To do so, continuous scores were rounded in order to be appropriately compared to the ground truth. The QWK metric was identified as the primary evaluation index as it is particularly suited for classification tasks. This variant of the weighted kappa provides the degree of agreement between two raters, thus measuring inter-observer variability. This metric strongly penalizes large errors by putting a bigger weight on such errors compared to small errors [i.e. when the predictions are closer to the ground truth]. We also used multiple typical classification metrics such as area under the ROC [AUROC], accuracy, sensitivity, specificity, positive predictive value [PPV] and negative predictive value [NPV]. Finally, we considered two binary classification tasks: the first one compared MES 0–1 against MES 2–3, and the second considered UCEIS ≤ 3 against UCEIS > 3 to compare with existing results. To evaluate the quality of the GUI, qualitative feedback was collected from four clinical specialists. Statistical analysis on the amount of time spent reviewing videos and their respective sections was performed.
3. Results
3.1. Data summary
The dataset used in this work contained 134 videos, accounting for 1550 030 frames. We describe in Table 1 the breakdown of each step and the resulting associated data.
Step . | Sub-step . | Data . |
---|---|---|
Video Quality Assessment | 134 high-quality videos 1550 030 frames | |
Preprocessing Pipeline Application | Blue light identifier | 1176 441 white-light frames |
Scorability assignment model | 582 448 scorable frames 593 993 non-scorable frames | |
Biopsy procedure and ex vivo detector | 22 543 biopsy procedure frames 66 910 ex vivo frames | |
Frame-based disease severity assessment model | 582 448 scorable frames predicted | |
Section Generation | 2630 scorable sections 386 432 scorable frames* | |
Graphical User Interface Leverage | ||
Disease Severity Assessment [SDA] | 2630 reviewed sections 386 432 scorable frames |
Step . | Sub-step . | Data . |
---|---|---|
Video Quality Assessment | 134 high-quality videos 1550 030 frames | |
Preprocessing Pipeline Application | Blue light identifier | 1176 441 white-light frames |
Scorability assignment model | 582 448 scorable frames 593 993 non-scorable frames | |
Biopsy procedure and ex vivo detector | 22 543 biopsy procedure frames 66 910 ex vivo frames | |
Frame-based disease severity assessment model | 582 448 scorable frames predicted | |
Section Generation | 2630 scorable sections 386 432 scorable frames* | |
Graphical User Interface Leverage | ||
Disease Severity Assessment [SDA] | 2630 reviewed sections 386 432 scorable frames |
*Not all scorable frames were used to create sections.
Step . | Sub-step . | Data . |
---|---|---|
Video Quality Assessment | 134 high-quality videos 1550 030 frames | |
Preprocessing Pipeline Application | Blue light identifier | 1176 441 white-light frames |
Scorability assignment model | 582 448 scorable frames 593 993 non-scorable frames | |
Biopsy procedure and ex vivo detector | 22 543 biopsy procedure frames 66 910 ex vivo frames | |
Frame-based disease severity assessment model | 582 448 scorable frames predicted | |
Section Generation | 2630 scorable sections 386 432 scorable frames* | |
Graphical User Interface Leverage | ||
Disease Severity Assessment [SDA] | 2630 reviewed sections 386 432 scorable frames |
Step . | Sub-step . | Data . |
---|---|---|
Video Quality Assessment | 134 high-quality videos 1550 030 frames | |
Preprocessing Pipeline Application | Blue light identifier | 1176 441 white-light frames |
Scorability assignment model | 582 448 scorable frames 593 993 non-scorable frames | |
Biopsy procedure and ex vivo detector | 22 543 biopsy procedure frames 66 910 ex vivo frames | |
Frame-based disease severity assessment model | 582 448 scorable frames predicted | |
Section Generation | 2630 scorable sections 386 432 scorable frames* | |
Graphical User Interface Leverage | ||
Disease Severity Assessment [SDA] | 2630 reviewed sections 386 432 scorable frames |
*Not all scorable frames were used to create sections.
The final dataset, obtained after the fifth step and used to train the SDA model, is partitioned as followed: 126 320 [33%], 33 742 [9%], 84 937 [22%] and 141 433 [36%] for an MES score of 0, 1, 2 and 3 respectively; 126 878 [33%], 71 098 [18%] and 188 456 [49%] for a Vascular Pattern of 0, 1 and 2 respectively; 151 993 [40%], 190 624 [49%], 38 349 [10%] and 5466 [1%] for Bleeding of 0, 1, 2 and 3 respectively; 159 780 [41%], 84 411 [22%], 81 986 [21%] and 60 255 [16%] for Erosions and Ulceration of 0, 1, 2 and 3 respectively [Figure 4]. All reviewed sections from the final dataset were kept as 100% of the labels for MES and the UCEIS and its three descriptors were identically identified by the reviewers.

Distribution of MES and UCEIS descriptors for each frame within reviewed sections. MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
3.2. Model performance compared to expert labels
We provide in Table 2 the performance of the model according to the MAE and Bias metrics. Overall, the model produced good performances at both section- and video-level. The MAE and Bias were relatively low considering the magnitude of the scoring scale, especially for the UCEIS. In fact, at both section- and video-level, for the MES and UCEIS individual subscores, SDA predictions were equal or less than a half point away from the true value. The model’s predictions for UCEIS are less than a point away from the ground truth. The Bias for the MES, UCEIS and its individual subscores for both section- and video-levels was slightly positive, except for the Vascular Pattern descriptor at section-level. The results presented in this paper are extracted from the test dataset, and several techniques were used during training to prevent overfitting of the model and to remove as much noise as possible in the data sets.
. | Section-level . | Video-level . | ||
---|---|---|---|---|
. | MAE . | Bias . | MAE . | Bias . |
MES | 0.32 | 0.05 | 0.19 | 0.19 |
UCEIS | 0.65 | 0.07 | 0.94 | 0.44 |
Vascular Pattern | 0.20 | −0.01 | 0.06 | 0.06 |
Bleeding | 0.24 | 0.01 | 0.44 | 0.06 |
Erosions and Ulcers | 0.36 | 0.10 | 0.50 | 0.12 |
. | Section-level . | Video-level . | ||
---|---|---|---|---|
. | MAE . | Bias . | MAE . | Bias . |
MES | 0.32 | 0.05 | 0.19 | 0.19 |
UCEIS | 0.65 | 0.07 | 0.94 | 0.44 |
Vascular Pattern | 0.20 | −0.01 | 0.06 | 0.06 |
Bleeding | 0.24 | 0.01 | 0.44 | 0.06 |
Erosions and Ulcers | 0.36 | 0.10 | 0.50 | 0.12 |
MES Scale [0–3]; UCEIS Scale [0–8]; Erosions and Ulcers [0–3], Vascular Pattern Scale [0–2]; Bleeding Scale [0–3]. We do not provide the results at frame-level because the ground truth is the score at section-level, projected down at frame-level to train the model. There is hence inherent error in the truth at frame-level that would be included in the performance results. MAE, mean absolute error; SDA, Disease Severity Assessment; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
. | Section-level . | Video-level . | ||
---|---|---|---|---|
. | MAE . | Bias . | MAE . | Bias . |
MES | 0.32 | 0.05 | 0.19 | 0.19 |
UCEIS | 0.65 | 0.07 | 0.94 | 0.44 |
Vascular Pattern | 0.20 | −0.01 | 0.06 | 0.06 |
Bleeding | 0.24 | 0.01 | 0.44 | 0.06 |
Erosions and Ulcers | 0.36 | 0.10 | 0.50 | 0.12 |
. | Section-level . | Video-level . | ||
---|---|---|---|---|
. | MAE . | Bias . | MAE . | Bias . |
MES | 0.32 | 0.05 | 0.19 | 0.19 |
UCEIS | 0.65 | 0.07 | 0.94 | 0.44 |
Vascular Pattern | 0.20 | −0.01 | 0.06 | 0.06 |
Bleeding | 0.24 | 0.01 | 0.44 | 0.06 |
Erosions and Ulcers | 0.36 | 0.10 | 0.50 | 0.12 |
MES Scale [0–3]; UCEIS Scale [0–8]; Erosions and Ulcers [0–3], Vascular Pattern Scale [0–2]; Bleeding Scale [0–3]. We do not provide the results at frame-level because the ground truth is the score at section-level, projected down at frame-level to train the model. There is hence inherent error in the truth at frame-level that would be included in the performance results. MAE, mean absolute error; SDA, Disease Severity Assessment; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
While the MAE is an appropriate metric to evaluate and compare regression models, the QWK metric is more suitable for classification tasks and to compare our best model to the GI experts’ labels. According to the scientific literature, a QWK between 0.61 and 0.8 is considered as substantial, while a QKW above 0.80 is stated as almost perfect agreement. Table 3 demonstrates the inter-observer agreement between expert endoscopists and the SDA model at section- and at video-level using the QWK metric. The model’s predictions at section-level were excellent, with a QWK over 0.8, except for the Bleeding descriptor. At video-level, the model’s performance was good with a QWK over 0.6 except for the Bleeding descriptor. Note that the QWK at video-level for UCEIS is relatively low compared to MES. This can be explained by the QKW quadratic progression of the penalty that over-penalizes a score with more levels. Thus, the two results are not directly comparable due to a different number of levels.
. | Section-level . | Video-level . |
---|---|---|
Mayo Endoscopic Subscore | 0.886 | 0.821 |
UCEIS | 0.904 | 0.646 |
Vascular Pattern | 0.905 | 0.879 |
Bleeding | 0.754 | 0.391 |
Erosions and Ulcers | 0.800 | 0.600 |
. | Section-level . | Video-level . |
---|---|---|
Mayo Endoscopic Subscore | 0.886 | 0.821 |
UCEIS | 0.904 | 0.646 |
Vascular Pattern | 0.905 | 0.879 |
Bleeding | 0.754 | 0.391 |
Erosions and Ulcers | 0.800 | 0.600 |
QWK, quadratic weighted kappa; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
. | Section-level . | Video-level . |
---|---|---|
Mayo Endoscopic Subscore | 0.886 | 0.821 |
UCEIS | 0.904 | 0.646 |
Vascular Pattern | 0.905 | 0.879 |
Bleeding | 0.754 | 0.391 |
Erosions and Ulcers | 0.800 | 0.600 |
. | Section-level . | Video-level . |
---|---|---|
Mayo Endoscopic Subscore | 0.886 | 0.821 |
UCEIS | 0.904 | 0.646 |
Vascular Pattern | 0.905 | 0.879 |
Bleeding | 0.754 | 0.391 |
Erosions and Ulcers | 0.800 | 0.600 |
QWK, quadratic weighted kappa; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
Model results are also presented at severity-level for both MES [Supplementary Table 1] and UCEIS [Supplementary Table 2] using classification metrics that included specificity, sensitivity, NPV, PPV and area under the curve [AUC]. The best MES model’s performance was for severity-level 0 and 3 with specificity of 94.60 and 87.90%, respectively; sensitivity of 85.71 and 69.14%, respectively; NPV of 92.00 and 87.70%, respectively; and PPV of 90.14 and 69.54%, respectively. This was aligned with the inter-reviewer variability of section scores that were low for all MES, UCEIS and its three descriptors, but especially for extremity levels such as MES 0 and 1. Variability was lower with the silver and gold standard teams, which were used to score the sections composing our final dataset. Because of our iterative approach, section labelling – section creation – model training, we were able to keep a low variability between our labellers.
Note that as shown in these two tables, results at section-level for both MES and UCEIS levels with less data [imbalanced classes] were similar compared to levels with more instances. AUC quantity, which determines how much the model is capable of distinguishing between classes, for MES 1 [0.715], which represents only 9% of all the MES instances, was similar to the metric for the other MES levels [AUC of 0.902, 0.68 and 0.785, for MES 0, 2 and 3, respectively]. Similarly, the AUC for UCEIS scores of 2, 3 and 7 [0.481, 0.669, 0.520] is close to the other UCEIS scores of 0, 1, 4, 5 and 6 [0.910, 0.653, 0.601, 0.688, 0.555].
Confusion matrices for MES and UCEIS at section-level are shown in Supplementary Figures 1 and 2. Accuracies were 69.00 and 54.80% for MES and UCEIS, respectively. Additionally, we computed the accuracy at ±1 severity-level for UCEIS to determine the degree of disagreement between the model’s prediction and the ground truth. The ±1 accuracy was 87.4, indicating a low error amplitude. As described in the Performance Metrics subsection, we also considered two binary classification tasks: MES 0–1 vs MES 2–3 and UCEIS vs UCEIS > 3. The results are presented in Supplementary Tables 3 and 4 for the two tasks, respectively.
3.3. GUI evaluation
In a preliminary review of the system by our four clinical experts [R.P., J.E.E., S.P.T., M.I.], we received the following feedbacks regarding the utilization of our tool to assess UC severity from endoscopic videos. Our tool allowed the user to focus on relevant sections instead of the whole video, reducing by a third the number of frames that a user needs to review. Out of the 1550 030 total number of frames, 386 432 were kept in the created sections reviewed by the GI specialists. It also improved the quality and consistency of scores since users are reviewing all the same segments, allowing for the multiple labels to be used iteratively to re-train and improve the underlying DL models.
Regarding the detection and scoring accuracy of UC disease using our GUI, this was evaluated with an additional 100 videos. These were processed through each step of our DL methodology, up to the section generation algorithm. On average, 30 sections per recording were created. Each generated section was reviewed and assessed for MES, UCEIS and its three descriptors by two of our clinical experts [S.P.T., M.I.]. Similar to our approach described in this paper, we kept the maximum of each score over the sections of a video to infer the score from section- to video-level. In total, 77% of the recordings reviewed by our experts were attributed the same score for MES, UCEIS and its three descriptors compared to the ground truth [S.P.T.], and 92% of the videos were labelled with MES, UCEIS and its three descriptor scores within half a point from the ground truth. The accuracy of the CR compared to our GUI supported by the various CNN models is similar to the performance obtained in the results section when comparing our models to the true label. The MAE and bias were 0.35 and 0.06, and 0.69 and 0.08 for MES and UCEIS respectively. The accuracy of the CR compared to our GUI supported by the various CNN models is similar to the performance obtained in the results section when comparing our models to the true label. The MAE and bias were 0.35 and 0.06, and 0.69 and 0.08 for MES and UCEIS respectively. In addition, our GUI resulted in faster video quality assessment and section scoring by GI specialists. Per our expert labellers, reviewing each video without GUI assistance takes 10–15 min. While using the GUI, the average time to review a section and a video was 26 s and 8 min, respectively. This 8 min was when the entire video was reviewed, only for the purpose of this study. However, in research and clinical applications, as the GUI will take the readers directly to the pertinent sections that need to be evaluated, we expect the actual time spent with the GUI would be less than 8 min. A short video demonstration of our GUI and AI model in action is provided [Supplementary Video 1]
4. Discussion
The endoscopic scoring of UC disease activity with MES and UCEIS has traditionally been challenging due to the lack of clinical validation, and also as a result of disagreement on repeated observations. CRs who are clinically blinded expert endoscopists have been utilized in UC clinical trials in an attempt to standardize the endoscopic assessment of UC.21 CR validation in the endoscopic scoring of UC is inherently limited due to the lack of a true gold standard [i.e. biopsy], and therefore any measures of accuracy may be impacted by the quality of CRs, algorithm performance, or inherent problems with the MES and UCEIS scoring system. Previous studies have reported the application of DL for the analysis of large endoscopy image datasets in order to improve and standardize UC disease severity grading. Ozawa et al. developed a DL model based on a GoogLeNet architecture to identify MES 0 and mucosal healing [score 0–1] in an independent test set of 3981 images from 114 UC patients, with a reported AUC of 0.86 and 0.98, respectively.16 Stidham et al focused on the binary classification task of MES 0–1 and MES 2–3, and reported a sensitivity, specificity, PPV, NPV, accuracy and QWK of 93%, 87%, 84%, 94%, 90% and 79%, respectively.18 We also compared the MES 0–1 scores against MES 2–3 scores and reported better results for all the values including sensitivity, specificity, PPV, NPV, accuracy and QWK of 96%, 91%, 91%, 96%, 94% and 87%, respectively. Takenaka et al. developed a CNN model to differentiate remission [UCEIS 3] from moderate–severe disease [UCEIS 3], and reported excellent reproducibility for their model with sensitivity, specificity, PPV, NPV and AUROC of 83%, 96%, 87%, 94% and 0.966, respectively.17 Our results on a similar task are 93%, 93%, 92%, 94% and 0.936, respectively. Yao et al. evaluated their CNN-based video analysis model on 264 videos and reported 83.7% accuracy for differentiating remission from active disease, with an AUC of 0.93, average F1 score of 0.77, and a positive level of agreement with gastroenterologist scoring [kappa = 0.84].19 Gottlieb et al. performed a randomized controlled trial to evaluate their CNN model in the assessment of mucosal inflammation according to MES and UCEIS. Their model’s overall performance on the primary objective metric showed almost perfect agreement, with QWK of 0.84 for MES and 0.85 for UCEIS.20 We have slightly better results on those same metrics, but at section-level as shown in Table 3 [0.886 for MES and 0.904 for UCEIS]. Their model’s accuracy at score-level results on video for MES is 70.2% compared to 69.0% for our work, and on UCEIS their accuracy is 45.5% compared to 54.8% for our work.
While previous AI work in the field can score UC activity at the frame- or video-level for either UCEIS or MES, we present the first fully automated DL model for scoring disease activity under both the MES and UCEIS scores, at frame-, section- and video-levels, and with an architecture that can accommodate other scoring systems such as Paddington International virtual Chromoendoscopy Score [PiCaSSO].22 We are also the first to describe an AI model that is ready for clinical evaluation both in terms of robustness and also in relation to usability and fitting in with current workflows. This has been a considerable challenge for AI tools in endoscopy, namely that they do not hinder the physician, but rather add true assistance and benefit. We have dedicated some of our efforts in building this solution with the practical usability of the tool very much at the forefront of our thinking. Our system improves the accuracy of the MES and UCEIS scores, reduces the time between video collection and review, and improves subsequent quality assurance and scoring. Overall, our model performed well, as MAE and mean bias at both section- and video-level were relatively close to the ground truth considering the magnitude of the scoring scale, especially for the UCEIS. In our investigation, the QWK was used to compare the inter-observer agreement between CR labels and the AI model’s predictions. The results were excellent at section-level for both MES and UCEIS, with QWK of 0.886 and 0.904, respectively.
One of the limitations of our study is a limited dataset with imbalanced classes. The generalizability of our model is also limited. However, to improve this, we have started validating our results by testing and training on different datasets with various endoscopy sites, equipment and recording techniques. Another limitation is the great difficulty in describing a fair comparison with the literature due to the lack of an open dataset. Moreover, the ground truth for MES and UCEIS might differ between experts and there is a potential subjectivity for the labellers when reviewing AI-generated sections with our GUI. In addition, our model is an offline tool that was designed to work on pre-recorded videos of UC patients. Finally, our model has not yet undergone formal prospective independent testing, but our plan is to perform this in the near future.
Overall, our results enable the development of a model that can be used to improve the efficiency and accuracy of endoscopic assessment and scoring of UC at different stages of the clinical journey, whether offline or live. It can be used by physicians at site-level for video quality assurance and also by central reading organizations and the pharmaceutical industry to score videos and increase the efficiency of central reading in clinical trials. It will also be usable as a tool for evaluation during live endoscopy where it could serve as an accurate reproducible measurement of endoscopic disease activity. Finally, there is an opportunity for education at the level of the GI trainees to set up training modules.
5. Conclusions
We report a fully automated DL model that improves the accuracy of the MES and UCEIS scores, reduces the time between video collection and review, and improves subsequent quality assurance and scoring. Our model demonstrated relevant feature identification for scoring of disease activity in UC, well aligned with scoring guidelines and performance of the experts. We present work that builds a frame-level regression scoring system paired with a clustering algorithm and video-level heuristics that score simultaneously under both scoring modalities. Going forward, we aim to continue developing our detection and scoring systems in order to produce a system that can score at a superhuman level and with greater precision than current scoring modalities. More data in terms of volume and diversity are being collected and analysed to drive towards a final product ready for clinical use. Moreover, we are also performing more formal evaluation of the usability of the GIU described in this study so that we can have a tool that is truly one that will offer time-saving and better user satisfaction.
Funding
J.E. and S.P.T. are funded by the National Institute for Health Research [NIHR] Oxford Biomedical Research Centre. The views expressed are those of the authors and not necessarily those of the National Health Service, the NIHR or the Department of Health.
Conflict of Interest
M.F.B.: CEO, Founder and shareholder in Satisfai Health Inc. J.E.E: Clinical advisory board and shareholder in Satisfai Health Inc. Speaker fees from Falk. R.P.: Clinical advisory board and shareholder in Satisfai Health Inc. M.L.H.: Shareholder in Satisfai Health Inc. G.L.: Employee, IVADO labs. L.S-D.: Employee, IVADO labs. P.L.: Employee, IVADO labs. S.N.: Employee, IVADO labsJ.A.: Employee, IVADO labs. R.M.: Shareholder in Satisfai Health Inc. E.D.C.: Employee, IVADO labs. F.S.: Employee, IVADO labs. S.P.T.: Consultant for Satisfai Health Inc. N.P.: Clinical advisory board and shareholder in Satisfai Health Inc. A.P.S., R.K., S.K.B., D.N.R. and H.R.R. have no financial relationships to disclose. M.I. has research grants from Pentax, Olympus and Fuji.
Author Contributions
M.F.B. [Conceptualization: Lead Investigation: Lead; Supervision: Lead; Writing—original draft: Lead; Writing—review & editing: Equal]. R.P. [Conceptualization: Lead; Investigation: Lead; Writing—original draft: Supporting; Writing—review & editing: Equal]. J.E.E. [Conceptualization: Lead; Investigation: Lead; Writing—original draft: Supporting; Writing—review & editing: Equal]. M.I. [Conceptualization: Supporting; Data curation: Equal; Writing—review & editing: Equal]. N.P. [Writing—original draft: Lead; Writing—review & editing: Equal]. R.K. [Data curation: Equal; Writing—review & editing: Equal]. N.R.D. [Data curation: Equal; Writing—review & editing: Equal]. H.R. [Data curation: Equal; Writing—review & editing: Equal]. A.P.S. [Data curation: Equal; Writing—review & editing: Equal]. S.B. [Writing—original draft: Lead; Writing—review & editing: Equal]. R.M. [Conceptualization: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. F.S. [Conceptualization: Lead; Investigation: LEAD; Methodology: Lead; Supervision: Lead; Writing—original draft: Lead; Writing—review & editing: Equal]. G.L. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. E.D.C. [Conceptualization: Supporting; Writing—review & editing: Equal]. L.S-D. [Conceptualization: Supporting; Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. P.L. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. S.N. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. J.A. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. M.L.H. [Data curation: Equal; Writing—original draft: Lead; Writing—review & editing: Equal]. S.P.T. [Investigation: LEAD; Data curation: Equal; Supervision: Lead; Writing—original draft: Supporting; Writing—review & editing: Equal].
Grant support
None.
Data availability statement
Data are available on request: the data underlying this article will be shared upon request to the corresponding author.