Abstract

Background and Aims

Lack of clinical validation and inter-observer variability are two limitations of endoscopic assessment and scoring of disease severity in patients with ulcerative colitis [UC]. We developed a deep learning [DL] model to improve, accelerate and automate UC detection, and predict the Mayo Endoscopic Subscore [MES] and the Ulcerative Colitis Endoscopic Index of Severity [UCEIS].

Methods

A total of 134 prospective videos [1550 030 frames] were collected and those with poor quality were excluded. The frames were labelled by experts based on MES and UCEIS scores. The scored frames were used to create a preprocessing pipeline and train multiple convolutional neural networks [CNNs] with proprietary algorithms in order to filter, detect and assess all frames. These frames served as the input for the DL model, with the output being continuous scores for MES and UCEIS [and its components]. A graphical user interface was developed to support both labelling video sections and displaying the predicted disease severity assessment by the artificial intelligence from endoscopic recordings.

Results

Mean absolute error [MAE] and mean bias were used to evaluate the distance of the continuous model’s predictions from ground truth, and its possible tendency to over/under-predict were excellent for MES and UCEIS. The quadratic weighted kappa used to compare the inter-rater agreement between experts’ labels and the model’s predictions showed strong agreement [0.87, 0.88 at frame-level, 0.88, 0.90 at section-level and 0.90, 0.78 at video-level, for MES and UCEIS, respectively].

Conclusions

We present the first fully automated tool that improves the accuracy of the MES and UCEIS, reduces the time between video collection and review, and improves subsequent quality assurance and scoring.

What You Need to Know

Background and context: Endoscopic assessment and scoring the disease severity in UC is limited by inter-observer variability and lack of clinical validation.

New findings: We present the first fully automated AI model for UC disease activity scoring under both the MES and UCEIS, at frame, section and video levels, and which is ready for use in clinical practice. Our model improves the accuracy of both scoring systems, reduces the time between video collection and review, and improves subsequent quality assurance and scoring.

Limitations: Limitations include a limited dataset with imbalanced classes, limited generalizability, difficulty in describing a fair comparison with the literature due to the lack of an open dataset, and subjective ground truth for MES and UCEIS resulting in potential bias for the labellers reviewing AI-generated sections with a GUI.

Impact: Our results enable the development of a model that can be used to improve the efficiency and accuracy of UC endoscopic assessment and scoring at different stages of the clinical journey, such as video quality assurance by physicians, and increase the efficiency of central reading in clinical trials.

1. Introduction

Ulcerative colitis [UC] is a chronic inflammatory disease of the colon and rectum with increasing incidence and prevalence worldwide.1 Several treatment options are available for UC, based on disease activity, severity and prior response to medical treatments.2 In patients with UC, disease activity and severity can be assessed using inflammatory markers, clinical symptom scores, endoscopic inflammation scores and histological scoring systems.3–7 One of the main goals of therapy in patients with UC is to achieve ‘mucosal healing’, which has been shown to be associated with decreased rates of steroid use, hospitalization, colectomy and improved quality of life.8 The status of mucosal inflammation during colonoscopy can be reported with scoring systems such as the Mayo Endoscopic Subscore [MES] and the Ulcerative Colitis Endoscopic Index of Severity [UCEIS].9,10 The MES 0-1 has been reported to be associated with improved rates of clinical remission, while the UCEIS score has been shown to be a more accurate reflection of UC severity and clinical remission, and of the short- and long-term clinical outcomes: clinical remission [UCEIS 0–1], mild [UCEIS 2–4], moderate [UCEIS 5–6] and severe [UCEIS 7–8].11–14

While disease severity scoring systems are established, the presence of inter-observer variability and lack of clinical validation remain two important limitations of endoscopic assessment and scoring of disease severity in patients with UC.10 To overcome these limitations and improve the inter-observer agreement, central reading by clinically blinded off-site experienced endoscopists, ‘Central Readers [CRs]’, has been used as a crucial component in UC clinical trials.15 Recently, artificial intelligence [AI] has been utilized to enhance the interpretation of endoscopic images to assess disease severity in patients with UC and to strive at reducing the delay and cost associated with central reading activity.16,17 Studies have shown encouraging results in the application of deep learning [DL] models in the UC diagnostic paradigm to improve the disease activity scoring, especially in central reading for clinical trials when using the MES.16–20 Stidham et al. built a DL model which successfully distinguished between active disease [UCEIS 2–4] and endoscopic remission [UCEIS 0–1] from colonoscopy videos and were able to identify the exact Mayo Endoscopic Subscores with comparable accuracy to three experienced human reviewers.18 However, the UCEIS scoring system, which focuses on design features that minimize inter-observer variability, may potentially allow for training models with superior assessment and scoring ability if all the features of the UCEIS score are used, rather than just binary distinctions between active disease and endoscopic remission. Therefore, we developed a DL model to improve, accelerate and automate UC disease detection. Specifically, we trained several convolutional neural network [CNN] models to pre-process endoscopic recordings with the final goal of assessing the MES, UCEIS and its three descriptor indices against video sections. Sections are automatically generated in continuous intervals of recording depicting a stable, observable disease state. A user-friendly and interactive graphical user interface [GUI] was designed to show the results of the various CNN models to help experts efficiently assign UC disease activity.

2. Materials and Methods

2.1. General

The aim of the models developed is to predict the MES, UCEIS and its three descriptor indices from the reviewed sections.

2.2. Data collection

A total of 134 unaltered and de-identified colonoscopy and sigmoidoscopy videos of UC patients were collected with Olympus HDR-60 scopes [190 and 180 Series] between October 2020 and April 2021 in the Asian Institute of Gastroenterology [AIG Hospitals]. Institutional review board approval was obtained for this study [AIGAEC-BH&R 08/10/2020-03]. These videos were encoded with the YUV420p pixel format and were de-interlaced at 25 frames per second, resulting in a total dataset of 1550 030 frames. In addition, 100 videos were collected for the purpose of GUI evaluation. We carried out various simulations to estimate the required sample size for our model’s accuracy. In brief, we generated 10 000 samples of random discrete uniformly distributed UCEIS scores with independent random normally distributed errors to simulate the model’s error.

2.3. First step: video quality assessment

Poor-quality videos with poor bowel preparation, ex vivo footage, and that were out-of-focus and provided almost no visible mucosa were filtered out in this step.

2.4. Second step: pre-processing pipeline application

The pre-processing pipeline consisted of four sub-steps as described in Figure 1. With the purpose of keeping only white-light frames with normal magnification, a heuristic to identify image-enhancement using blue light imaging based on pixel colour was applied to all frames.

Our fully automated ulcerative colitis detection and scoring decision support methodology.
Figure 1.

Our fully automated ulcerative colitis detection and scoring decision support methodology.

Scorable frames were defined such that they easily allow the assessment of UC activity. Non-scorable frames were considered as challenging frames for scoring UC due to either faeces and/or water jet presence, a visible biopsy tool and post-biopsy bleeding [not to be confused with blood from the disease itself]. They can also include ex vivo and out-of-focus [shadowed, too close to the mucosa or blurred] frames. For this task, we employed a CNN which outputs the probability of a frame being scorable, and we kept the frames that met a given threshold.

The MES, UCEIS and the three descriptors for UCEIS [Vascular Pattern, Bleeding Erosions and Ulceration] were then assigned to scorable frames detected ‘as is’ by the scorability assignment model with a dedicated CNN. The predicted scorable images were used to create continuous stable disease state sections in the next step of the workflow.

2.5. Third step: section generation

To mimic the performance of experts in the reviewing process for UC disease assessment, we broke down endoscopy videos into short sections of continuous frames representing stable disease states to score coherent parts of the videos. Decomposition into such sections was motivated by two goals: the need to stabilize the endoscopic video review process by expert readers, and to also develop an efficient hierarchical in-house labelling system utilizing the expertise of an internal specialist team. This team consisted of one global central reading expert [S.P.T., the clinician who first described the UCEIS scoring system, gold standard], six gastrointestinal [GI] specialists [silver standard] and 20 GI trainees [bronze standard]. We developed an algorithm in which inputs were the frames that were assigned a score in the last step of the preprocessing pipeline, and which had output sections varying from 3 to 20 s. The algorithm was defined so that it created coherent and long enough sections by using consecutive frames with continuous scores that were smoothed to avoid outliers. Sections were created from all the scored frames, whether they were scorable or non-scorable. Our sections generated from the algorithm were short sections of continuous and mainly scorable frames representing stable disease states as explained in the previous paragraph. In addition, the section creation process was developed as an offline process to be executed on recorded videos.

2.6. Fourth step: utilizing a GUI

The previous steps of the DL approach enabled the development of a GUI displaying videos and their respective created sections for review. The web-based interface was built with sequential ordered steps to optimize the endoscopic videos review workflow, as described in Figure 2.

Overview of the graphical user interface.
Figure 2.

Overview of the graphical user interface.

Reviewers can first access the tool whenever endoscopic videos from patients with UC need to be reviewed [step 1 in Figure 2]. Frames from videos under review have gone through the automated preprocessing pipeline and were estimated as either non-scorable or scorable. Scorable frames underwent evaluation by the section creation and refinement algorithms such that newly created and refined sections could be represented along with the non-scorable frames in a timeline under the video in review [step 2 in Figure 2]. For continuous sections representing a stable disease state, gradient colours were used to highlight the severity of the UC. The tool featured the presentation—in decreasing order—of the first high disease activity section down to the last one based on MES and UCEIS scores so that the readers save time by reviewing only the relevant sections of the video to confirm the score assigned to each section and the whole recording [step 3 in Figure 2]. If needed, users can tag specific features in the reviewed video such as colonoscope trauma, biopsy blood and poor bowel preparation, all of which can be utilized later to optimize the preprocessing pipeline [step 4 in Figure 2]. Once the user scores a section that is at least 2 points higher than any of the remaining AI-scored sections, the video will be assigned the highest UCEIS score and MES score [step 5 in Figure 2]. Although not shown in Figure 2, the highest MES is also displayable by configuration. This results in a live, simple, interactive and user-friendly application that can be used to improve the reviewed workflow, speeding up the reading process while improving the accuracy of UC disease activity assessment.

The GUI was used to obtain high-quality labels at a section-level. Thus, the tool was used by GI specialists [gold and silver standards] to review each generated sections to either confirm or refute the estimated MES, UCEIS and the three UCEIS descriptors as required. At least two reviewers scored every section used in the training phase. Utilization of the sections built from raw labels and reviewed by medical experts resulted in a faster review process, and a large quantity of high-quality ground truth labels. The latter was used to train the section-based severity assignment model in the final step of the approach. Of note, the GUI can also be used as a review tool by CRs to automatically characterize UC disease activity in endoscopic videos.

2.7. Fifth step: disease severity assessment

2.7.1. Dataset creation

In the final step of the DL workflow, we trained a CNN referred to as ‘Section-based Disease Assessment [SDA]’. The data used to develop the model contained frames from each section that was assigned an MES score, or UCEIS and its three descriptors. The scores of a section were assigned to all the frames which made up that section. The dataset used in this final step of the workflow was split at a video-level into training, validation and test sets in a 60–20–20% distribution, with no overlap of videos used in these three sets.

2.7.2. Model generation

We developed a CNN model taking frames from reviewed sections as input, which then outputs continuous scores for MES, UCEIS and its three descriptors, also at frame-level. The objective was to provide a precise score to help UC disease severity assessment as well as to provide granular reporting of results. The CNN-based model is an EfficientNetB3 architecture with weights pre-trained on ImageNet. We appended a global average pooling layer and dense layers to this network. The output layer contained five separate dense layers predicting continuous scores at frame-level according to their respective scale: MES, aggregate UCEIS and the individual UCEIS descriptors: Vascular Pattern, Bleeding, and Erosions and Ulceration. Thus, for each input frame, the model predicted five scores. The model’s high-level architecture is shown in Figure 3.

High-level architecture of the SDA CNN model predicting MES, UCEIS and its descriptors. SDA, Disease Severity Assessment; CNN, convolutional neural network; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
Figure 3.

High-level architecture of the SDA CNN model predicting MES, UCEIS and its descriptors. SDA, Disease Severity Assessment; CNN, convolutional neural network; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

2.7.3. Model training

Various iterations of the architecture were evaluated by varying the random seed and the hyperparameters such as the loss function, dropout rate, learning rate, number of epochs and optimizer to both ensure high precision results and prevent overfitting. In addition to the hyperparameter search, an architecture search was also performed, exploring a variety of CNN architectures and dense layer configurations. Note that while all models were assessed on a validation set, all results shown in this paper are based on a separate hold-out/test set, unseen during training or validation. Model generation and training experiments were performed using TensorFlow.

Throughout the DL workflow, we developed an iterative approach to improve the section definition and thus the model’s predictions. With the SDA CNN’s outputs based on the first labels from the bronze standard, we updated the heuristic that defined section boundaries in videos. These updated section definitions had greater autocorrelation, resulting in more consistent and accurate section-level reads. Once these new sections were reviewed by the silver standard team and evaluated against the gold standard resource, an improved model was trained based on the new sections, allowing for an iterative approach to improving section scores. Through the iterations between section creation, section labelling and model training, we kept only the sections where the scores for MES, UCEIS and its three descriptors between the reviewers were less than a half point away so that our ground truth was as accurate as possible. The degree of discrepancy between the MES and UCEIS for a given section were negligible as the medical expert reviewers were consistent in their assessment.

2.7.4. Performance metrics

We assessed the model performance on MES, UCEIS and UCEIS descriptors at section- and video-level. To obtain section-level scores from frame-level predictions, we computed the 83rd percentile for each score over the frames belonging to a section [this number was chosen based on the results of the validation set]. The 83rd percentile was chosen as it optimized the inter-observer agreement as measured by quadratic weighted kappa [QWK] in the validation set. To infer section-level predictions at video-level, we calculated the maximum of each score over the sections belonging to a video. We assessed the performance with the metrics described below.

We considered the mean absolute error [MAE], a well-known measure of accuracy for regression problems. It is defined as the absolute difference between the model inferences and the ground truth for continuous results. The bias was then used to evaluate the direction of the performance error. It is defined as the difference between the score predicted by the model and the true label. Those metrics were assessed mainly to evaluate and compare the performance of our different regression model iterations.

Although we formulated the problem as a regression task, we proceeded to analyse the results as a classification problem because scoring conventions are on discrete MES and UCEIS values. To do so, continuous scores were rounded in order to be appropriately compared to the ground truth. The QWK metric was identified as the primary evaluation index as it is particularly suited for classification tasks. This variant of the weighted kappa provides the degree of agreement between two raters, thus measuring inter-observer variability. This metric strongly penalizes large errors by putting a bigger weight on such errors compared to small errors [i.e. when the predictions are closer to the ground truth]. We also used multiple typical classification metrics such as area under the ROC [AUROC], accuracy, sensitivity, specificity, positive predictive value [PPV] and negative predictive value [NPV]. Finally, we considered two binary classification tasks: the first one compared MES 0–1 against MES 2–3, and the second considered UCEIS ≤ 3 against UCEIS > 3 to compare with existing results. To evaluate the quality of the GUI, qualitative feedback was collected from four clinical specialists. Statistical analysis on the amount of time spent reviewing videos and their respective sections was performed.

3. Results

3.1. Data summary

The dataset used in this work contained 134 videos, accounting for 1550 030 frames. We describe in Table 1 the breakdown of each step and the resulting associated data.

Table 1.

Quantity of data after each step of the workflow

StepSub-stepData
Video Quality Assessment134 high-quality videos
1550 030 frames
Preprocessing Pipeline ApplicationBlue light identifier1176 441 white-light frames
Scorability assignment model582 448 scorable frames
593 993 non-scorable frames
Biopsy procedure and ex vivo detector22 543 biopsy procedure frames
66 910 ex vivo frames
Frame-based disease severity assessment model582 448 scorable frames predicted
Section Generation2630 scorable sections
386 432 scorable frames*
Graphical User Interface Leverage
Disease Severity Assessment [SDA]2630 reviewed sections
386 432 scorable frames
StepSub-stepData
Video Quality Assessment134 high-quality videos
1550 030 frames
Preprocessing Pipeline ApplicationBlue light identifier1176 441 white-light frames
Scorability assignment model582 448 scorable frames
593 993 non-scorable frames
Biopsy procedure and ex vivo detector22 543 biopsy procedure frames
66 910 ex vivo frames
Frame-based disease severity assessment model582 448 scorable frames predicted
Section Generation2630 scorable sections
386 432 scorable frames*
Graphical User Interface Leverage
Disease Severity Assessment [SDA]2630 reviewed sections
386 432 scorable frames

*Not all scorable frames were used to create sections.

Table 1.

Quantity of data after each step of the workflow

StepSub-stepData
Video Quality Assessment134 high-quality videos
1550 030 frames
Preprocessing Pipeline ApplicationBlue light identifier1176 441 white-light frames
Scorability assignment model582 448 scorable frames
593 993 non-scorable frames
Biopsy procedure and ex vivo detector22 543 biopsy procedure frames
66 910 ex vivo frames
Frame-based disease severity assessment model582 448 scorable frames predicted
Section Generation2630 scorable sections
386 432 scorable frames*
Graphical User Interface Leverage
Disease Severity Assessment [SDA]2630 reviewed sections
386 432 scorable frames
StepSub-stepData
Video Quality Assessment134 high-quality videos
1550 030 frames
Preprocessing Pipeline ApplicationBlue light identifier1176 441 white-light frames
Scorability assignment model582 448 scorable frames
593 993 non-scorable frames
Biopsy procedure and ex vivo detector22 543 biopsy procedure frames
66 910 ex vivo frames
Frame-based disease severity assessment model582 448 scorable frames predicted
Section Generation2630 scorable sections
386 432 scorable frames*
Graphical User Interface Leverage
Disease Severity Assessment [SDA]2630 reviewed sections
386 432 scorable frames

*Not all scorable frames were used to create sections.

The final dataset, obtained after the fifth step and used to train the SDA model, is partitioned as followed: 126 320 [33%], 33 742 [9%], 84 937 [22%] and 141 433 [36%] for an MES score of 0, 1, 2 and 3 respectively; 126 878 [33%], 71 098 [18%] and 188 456 [49%] for a Vascular Pattern of 0, 1 and 2 respectively; 151 993 [40%], 190 624 [49%], 38 349 [10%] and 5466 [1%] for Bleeding of 0, 1, 2 and 3 respectively; 159 780 [41%], 84 411 [22%], 81 986 [21%] and 60 255 [16%] for Erosions and Ulceration of 0, 1, 2 and 3 respectively [Figure 4]. All reviewed sections from the final dataset were kept as 100% of the labels for MES and the UCEIS and its three descriptors were identically identified by the reviewers.

Distribution of MES and UCEIS descriptors for each frame within reviewed sections. MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.
Figure 4.

Distribution of MES and UCEIS descriptors for each frame within reviewed sections. MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

3.2. Model performance compared to expert labels

We provide in Table 2 the performance of the model according to the MAE and Bias metrics. Overall, the model produced good performances at both section- and video-level. The MAE and Bias were relatively low considering the magnitude of the scoring scale, especially for the UCEIS. In fact, at both section- and video-level, for the MES and UCEIS individual subscores, SDA predictions were equal or less than a half point away from the true value. The model’s predictions for UCEIS are less than a point away from the ground truth. The Bias for the MES, UCEIS and its individual subscores for both section- and video-levels was slightly positive, except for the Vascular Pattern descriptor at section-level. The results presented in this paper are extracted from the test dataset, and several techniques were used during training to prevent overfitting of the model and to remove as much noise as possible in the data sets.

Table 2.

MAE and Bias measures of the SDA model

Section-levelVideo-level
MAEBiasMAEBias
MES0.320.050.190.19
UCEIS0.650.070.940.44
Vascular Pattern0.20−0.010.060.06
Bleeding0.240.010.440.06
Erosions and Ulcers0.360.100.500.12
Section-levelVideo-level
MAEBiasMAEBias
MES0.320.050.190.19
UCEIS0.650.070.940.44
Vascular Pattern0.20−0.010.060.06
Bleeding0.240.010.440.06
Erosions and Ulcers0.360.100.500.12

MES Scale [0–3]; UCEIS Scale [0–8]; Erosions and Ulcers [0–3], Vascular Pattern Scale [0–2]; Bleeding Scale [0–3]. We do not provide the results at frame-level because the ground truth is the score at section-level, projected down at frame-level to train the model. There is hence inherent error in the truth at frame-level that would be included in the performance results. MAE, mean absolute error; SDA, Disease Severity Assessment; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

Table 2.

MAE and Bias measures of the SDA model

Section-levelVideo-level
MAEBiasMAEBias
MES0.320.050.190.19
UCEIS0.650.070.940.44
Vascular Pattern0.20−0.010.060.06
Bleeding0.240.010.440.06
Erosions and Ulcers0.360.100.500.12
Section-levelVideo-level
MAEBiasMAEBias
MES0.320.050.190.19
UCEIS0.650.070.940.44
Vascular Pattern0.20−0.010.060.06
Bleeding0.240.010.440.06
Erosions and Ulcers0.360.100.500.12

MES Scale [0–3]; UCEIS Scale [0–8]; Erosions and Ulcers [0–3], Vascular Pattern Scale [0–2]; Bleeding Scale [0–3]. We do not provide the results at frame-level because the ground truth is the score at section-level, projected down at frame-level to train the model. There is hence inherent error in the truth at frame-level that would be included in the performance results. MAE, mean absolute error; SDA, Disease Severity Assessment; MES, Mayo Endoscopic Subscore; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

While the MAE is an appropriate metric to evaluate and compare regression models, the QWK metric is more suitable for classification tasks and to compare our best model to the GI experts’ labels. According to the scientific literature, a QWK between 0.61 and 0.8 is considered as substantial, while a QKW above 0.80 is stated as almost perfect agreement. Table 3 demonstrates the inter-observer agreement between expert endoscopists and the SDA model at section- and at video-level using the QWK metric. The model’s predictions at section-level were excellent, with a QWK over 0.8, except for the Bleeding descriptor. At video-level, the model’s performance was good with a QWK over 0.6 except for the Bleeding descriptor. Note that the QWK at video-level for UCEIS is relatively low compared to MES. This can be explained by the QKW quadratic progression of the penalty that over-penalizes a score with more levels. Thus, the two results are not directly comparable due to a different number of levels.

Table 3.

The results of inter-observer agreement QWK at section- and video-level

Section-levelVideo-level
Mayo Endoscopic Subscore0.8860.821
UCEIS0.9040.646
Vascular Pattern0.9050.879
Bleeding0.7540.391
Erosions and Ulcers0.8000.600
Section-levelVideo-level
Mayo Endoscopic Subscore0.8860.821
UCEIS0.9040.646
Vascular Pattern0.9050.879
Bleeding0.7540.391
Erosions and Ulcers0.8000.600

QWK, quadratic weighted kappa; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

Table 3.

The results of inter-observer agreement QWK at section- and video-level

Section-levelVideo-level
Mayo Endoscopic Subscore0.8860.821
UCEIS0.9040.646
Vascular Pattern0.9050.879
Bleeding0.7540.391
Erosions and Ulcers0.8000.600
Section-levelVideo-level
Mayo Endoscopic Subscore0.8860.821
UCEIS0.9040.646
Vascular Pattern0.9050.879
Bleeding0.7540.391
Erosions and Ulcers0.8000.600

QWK, quadratic weighted kappa; UCEIS, Ulcerative Colitis Endoscopic Index of Severity.

Model results are also presented at severity-level for both MES [Supplementary Table 1] and UCEIS [Supplementary Table 2] using classification metrics that included specificity, sensitivity, NPV, PPV and area under the curve [AUC]. The best MES model’s performance was for severity-level 0 and 3 with specificity of 94.60 and 87.90%, respectively; sensitivity of 85.71 and 69.14%, respectively; NPV of 92.00 and 87.70%, respectively; and PPV of 90.14 and 69.54%, respectively. This was aligned with the inter-reviewer variability of section scores that were low for all MES, UCEIS and its three descriptors, but especially for extremity levels such as MES 0 and 1. Variability was lower with the silver and gold standard teams, which were used to score the sections composing our final dataset. Because of our iterative approach, section labelling – section creation – model training, we were able to keep a low variability between our labellers.

Note that as shown in these two tables, results at section-level for both MES and UCEIS levels with less data [imbalanced classes] were similar compared to levels with more instances. AUC quantity, which determines how much the model is capable of distinguishing between classes, for MES 1 [0.715], which represents only 9% of all the MES instances, was similar to the metric for the other MES levels [AUC of 0.902, 0.68 and 0.785, for MES 0, 2 and 3, respectively]. Similarly, the AUC for UCEIS scores of 2, 3 and 7 [0.481, 0.669, 0.520] is close to the other UCEIS scores of 0, 1, 4, 5 and 6 [0.910, 0.653, 0.601, 0.688, 0.555].

Confusion matrices for MES and UCEIS at section-level are shown in Supplementary Figures 1 and 2. Accuracies were 69.00 and 54.80% for MES and UCEIS, respectively. Additionally, we computed the accuracy at ±1 severity-level for UCEIS to determine the degree of disagreement between the model’s prediction and the ground truth. The ±1 accuracy was 87.4, indicating a low error amplitude. As described in the Performance Metrics subsection, we also considered two binary classification tasks: MES 0–1 vs MES 2–3 and UCEIS vs UCEIS > 3. The results are presented in Supplementary Tables 3 and 4 for the two tasks, respectively.

3.3. GUI evaluation

In a preliminary review of the system by our four clinical experts [R.P., J.E.E., S.P.T., M.I.], we received the following feedbacks regarding the utilization of our tool to assess UC severity from endoscopic videos. Our tool allowed the user to focus on relevant sections instead of the whole video, reducing by a third the number of frames that a user needs to review. Out of the 1550 030 total number of frames, 386 432 were kept in the created sections reviewed by the GI specialists. It also improved the quality and consistency of scores since users are reviewing all the same segments, allowing for the multiple labels to be used iteratively to re-train and improve the underlying DL models.

Regarding the detection and scoring accuracy of UC disease using our GUI, this was evaluated with an additional 100 videos. These were processed through each step of our DL methodology, up to the section generation algorithm. On average, 30 sections per recording were created. Each generated section was reviewed and assessed for MES, UCEIS and its three descriptors by two of our clinical experts [S.P.T., M.I.]. Similar to our approach described in this paper, we kept the maximum of each score over the sections of a video to infer the score from section- to video-level. In total, 77% of the recordings reviewed by our experts were attributed the same score for MES, UCEIS and its three descriptors compared to the ground truth [S.P.T.], and 92% of the videos were labelled with MES, UCEIS and its three descriptor scores within half a point from the ground truth. The accuracy of the CR compared to our GUI supported by the various CNN models is similar to the performance obtained in the results section when comparing our models to the true label. The MAE and bias were 0.35 and 0.06, and 0.69 and 0.08 for MES and UCEIS respectively. The accuracy of the CR compared to our GUI supported by the various CNN models is similar to the performance obtained in the results section when comparing our models to the true label. The MAE and bias were 0.35 and 0.06, and 0.69 and 0.08 for MES and UCEIS respectively. In addition, our GUI resulted in faster video quality assessment and section scoring by GI specialists. Per our expert labellers, reviewing each video without GUI assistance takes 10–15 min. While using the GUI, the average time to review a section and a video was 26 s and 8 min, respectively. This 8 min was when the entire video was reviewed, only for the purpose of this study. However, in research and clinical applications, as the GUI will take the readers directly to the pertinent sections that need to be evaluated, we expect the actual time spent with the GUI would be less than 8 min. A short video demonstration of our GUI and AI model in action is provided [Supplementary Video 1]

4. Discussion

The endoscopic scoring of UC disease activity with MES and UCEIS has traditionally been challenging due to the lack of clinical validation, and also as a result of disagreement on repeated observations. CRs who are clinically blinded expert endoscopists have been utilized in UC clinical trials in an attempt to standardize the endoscopic assessment of UC.21 CR validation in the endoscopic scoring of UC is inherently limited due to the lack of a true gold standard [i.e. biopsy], and therefore any measures of accuracy may be impacted by the quality of CRs, algorithm performance, or inherent problems with the MES and UCEIS scoring system. Previous studies have reported the application of DL for the analysis of large endoscopy image datasets in order to improve and standardize UC disease severity grading. Ozawa et al. developed a DL model based on a GoogLeNet architecture to identify MES 0 and mucosal healing [score 0–1] in an independent test set of 3981 images from 114 UC patients, with a reported AUC of 0.86 and 0.98, respectively.16 Stidham et al focused on the binary classification task of MES 0–1 and MES 2–3, and reported a sensitivity, specificity, PPV, NPV, accuracy and QWK of 93%, 87%, 84%, 94%, 90% and 79%, respectively.18 We also compared the MES 0–1 scores against MES 2–3 scores and reported better results for all the values including sensitivity, specificity, PPV, NPV, accuracy and QWK of 96%, 91%, 91%, 96%, 94% and 87%, respectively. Takenaka et al. developed a CNN model to differentiate remission [UCEIS 3] from moderate–severe disease [UCEIS 3], and reported excellent reproducibility for their model with sensitivity, specificity, PPV, NPV and AUROC of 83%, 96%, 87%, 94% and 0.966, respectively.17 Our results on a similar task are 93%, 93%, 92%, 94% and 0.936, respectively. Yao et al. evaluated their CNN-based video analysis model on 264 videos and reported 83.7% accuracy for differentiating remission from active disease, with an AUC of 0.93, average F1 score of 0.77, and a positive level of agreement with gastroenterologist scoring [kappa = 0.84].19 Gottlieb et al. performed a randomized controlled trial to evaluate their CNN model in the assessment of mucosal inflammation according to MES and UCEIS. Their model’s overall performance on the primary objective metric showed almost perfect agreement, with QWK of 0.84 for MES and 0.85 for UCEIS.20 We have slightly better results on those same metrics, but at section-level as shown in Table 3 [0.886 for MES and 0.904 for UCEIS]. Their model’s accuracy at score-level results on video for MES is 70.2% compared to 69.0% for our work, and on UCEIS their accuracy is 45.5% compared to 54.8% for our work.

While previous AI work in the field can score UC activity at the frame- or video-level for either UCEIS or MES, we present the first fully automated DL model for scoring disease activity under both the MES and UCEIS scores, at frame-, section- and video-levels, and with an architecture that can accommodate other scoring systems such as Paddington International virtual Chromoendoscopy Score [PiCaSSO].22 We are also the first to describe an AI model that is ready for clinical evaluation both in terms of robustness and also in relation to usability and fitting in with current workflows. This has been a considerable challenge for AI tools in endoscopy, namely that they do not hinder the physician, but rather add true assistance and benefit. We have dedicated some of our efforts in building this solution with the practical usability of the tool very much at the forefront of our thinking. Our system improves the accuracy of the MES and UCEIS scores, reduces the time between video collection and review, and improves subsequent quality assurance and scoring. Overall, our model performed well, as MAE and mean bias at both section- and video-level were relatively close to the ground truth considering the magnitude of the scoring scale, especially for the UCEIS. In our investigation, the QWK was used to compare the inter-observer agreement between CR labels and the AI model’s predictions. The results were excellent at section-level for both MES and UCEIS, with QWK of 0.886 and 0.904, respectively.

One of the limitations of our study is a limited dataset with imbalanced classes. The generalizability of our model is also limited. However, to improve this, we have started validating our results by testing and training on different datasets with various endoscopy sites, equipment and recording techniques. Another limitation is the great difficulty in describing a fair comparison with the literature due to the lack of an open dataset. Moreover, the ground truth for MES and UCEIS might differ between experts and there is a potential subjectivity for the labellers when reviewing AI-generated sections with our GUI. In addition, our model is an offline tool that was designed to work on pre-recorded videos of UC patients. Finally, our model has not yet undergone formal prospective independent testing, but our plan is to perform this in the near future.

Overall, our results enable the development of a model that can be used to improve the efficiency and accuracy of endoscopic assessment and scoring of UC at different stages of the clinical journey, whether offline or live. It can be used by physicians at site-level for video quality assurance and also by central reading organizations and the pharmaceutical industry to score videos and increase the efficiency of central reading in clinical trials. It will also be usable as a tool for evaluation during live endoscopy where it could serve as an accurate reproducible measurement of endoscopic disease activity. Finally, there is an opportunity for education at the level of the GI trainees to set up training modules.

5. Conclusions

We report a fully automated DL model that improves the accuracy of the MES and UCEIS scores, reduces the time between video collection and review, and improves subsequent quality assurance and scoring. Our model demonstrated relevant feature identification for scoring of disease activity in UC, well aligned with scoring guidelines and performance of the experts. We present work that builds a frame-level regression scoring system paired with a clustering algorithm and video-level heuristics that score simultaneously under both scoring modalities. Going forward, we aim to continue developing our detection and scoring systems in order to produce a system that can score at a superhuman level and with greater precision than current scoring modalities. More data in terms of volume and diversity are being collected and analysed to drive towards a final product ready for clinical use. Moreover, we are also performing more formal evaluation of the usability of the GIU described in this study so that we can have a tool that is truly one that will offer time-saving and better user satisfaction.

Funding

J.E. and S.P.T. are funded by the National Institute for Health Research [NIHR] Oxford Biomedical Research Centre. The views expressed are those of the authors and not necessarily those of the National Health Service, the NIHR or the Department of Health.

Conflict of Interest

M.F.B.: CEO, Founder and shareholder in Satisfai Health Inc. J.E.E: Clinical advisory board and shareholder in Satisfai Health Inc. Speaker fees from Falk. R.P.: Clinical advisory board and shareholder in Satisfai Health Inc. M.L.H.: Shareholder in Satisfai Health Inc. G.L.: Employee, IVADO labs. L.S-D.: Employee, IVADO labs. P.L.: Employee, IVADO labs. S.N.: Employee, IVADO labsJ.A.: Employee, IVADO labs. R.M.: Shareholder in Satisfai Health Inc. E.D.C.: Employee, IVADO labs. F.S.: Employee, IVADO labs. S.P.T.: Consultant for Satisfai Health Inc. N.P.: Clinical advisory board and shareholder in Satisfai Health Inc. A.P.S., R.K., S.K.B., D.N.R. and H.R.R. have no financial relationships to disclose. M.I. has research grants from Pentax, Olympus and Fuji.

Author Contributions

M.F.B. [Conceptualization: Lead Investigation: Lead; Supervision: Lead; Writing—original draft: Lead; Writing—review & editing: Equal]. R.P. [Conceptualization: Lead; Investigation: Lead; Writing—original draft: Supporting; Writing—review & editing: Equal]. J.E.E. [Conceptualization: Lead; Investigation: Lead; Writing—original draft: Supporting; Writing—review & editing: Equal]. M.I. [Conceptualization: Supporting; Data curation: Equal; Writing—review & editing: Equal]. N.P. [Writing—original draft: Lead; Writing—review & editing: Equal]. R.K. [Data curation: Equal; Writing—review & editing: Equal]. N.R.D. [Data curation: Equal; Writing—review & editing: Equal]. H.R. [Data curation: Equal; Writing—review & editing: Equal]. A.P.S. [Data curation: Equal; Writing—review & editing: Equal]. S.B. [Writing—original draft: Lead; Writing—review & editing: Equal]. R.M. [Conceptualization: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. F.S. [Conceptualization: Lead; Investigation: LEAD; Methodology: Lead; Supervision: Lead; Writing—original draft: Lead; Writing—review & editing: Equal]. G.L. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. E.D.C. [Conceptualization: Supporting; Writing—review & editing: Equal]. L.S-D. [Conceptualization: Supporting; Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. P.L. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. S.N. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. J.A. [Methodology: Supporting; Writing—original draft: Lead; Writing—review & editing: Equal]. M.L.H. [Data curation: Equal; Writing—original draft: Lead; Writing—review & editing: Equal]. S.P.T. [Investigation: LEAD; Data curation: Equal; Supervision: Lead; Writing—original draft: Supporting; Writing—review & editing: Equal].

Grant support

None.

Data availability statement

Data are available on request: the data underlying this article will be shared upon request to the corresponding author.

References

1.

Dignass
A
,
Eliakim
R
,
Magro
F
, et al. .
Second European evidence-based consensus on the diagnosis and management of ulcerative colitis Part 1: definitions and diagnosis.
J Crohn Colitis.
2012
;
6
:
965
90
.

2.

Singh
S
,
Fumery
M
,
Sandborn
WJ
,
Murad
MH.
Systematic review with network meta-analysis: first-and second-line pharmacotherapy for moderate-severe ulcerative colitis
.
Aliment Pharmacol Ther
2018
;
47
:
162
75
.

3.

Lewis
JD
,
Chuai
S
,
Nessel
L
,
Lichtenstein
GR
,
Aberra
FN
,
Ellenberg
JH.
Use of the noninvasive components of the Mayo score to assess clinical response in ulcerative colitis
.
Inflamm Bowel Dis
2008
;
14
:
1660
6
.

4.

Jones
J
,
LoftusPanaccione
EVR
,
Chen
L-S
, et al. .
Relationships between disease activity and serum and fecal biomarkers in patients with Crohn’s disease
.
Clin Gastroenterol Hepatol
2008
;
6
:
1218
24
.

5.

Schoepfer
AM
,
Beglinger
C
,
Straumann
A
, et al. .
Fecal calprotectin more accurately reflects endoscopic activity of ulcerative colitis than the Lichtiger Index, C-reactive protein, platelets, hemoglobin, and blood leukocytes
.
Inflamm Bowel Dis
2013
;
19
:
332
41
.

6.

Xie
T
,
Zhang
T
,
Ding
C
, et al. .
Ulcerative Colitis Endoscopic Index of Severity (UCEIS) vs Mayo Endoscopic Score (MES) in guiding the need for colectomy in patients with acute severe colitis
.
Gastroenterol Rep [Oxf]
2018
;
6
:
38
44
.

7.

Novak
G
,
Parker
CE
,
Pai
RK
, et al. .
Histologic scoring indices for evaluation of disease activity in Crohn’s disease
.
Cochrane Database Syst Rev
2017
;
7
:
CD012351
.

8.

Neurath
MF
,
Travis
SP.
Mucosal healing in inflammatory bowel diseases: a systematic review
.
Gut
2012
;
61
:
1619
35
.

9.

Schroeder
KW
,
Tremaine
WJ
,
Ilstrup
DM.
Coated oral 5-aminosalicylic acid therapy for mildly to moderately active ulcerative colitis
.
N Engl J Med
1987
;
317
:
1625
1629
.

10.

Travis
SPL
,
Schnell
D
,
Krzeski
P
, et al. .
Developing an instrument to assess the endoscopic severity of ulcerative colitis: The Ulcerative Colitis Endoscopic Index of Severity of ulcerative colitis: the Ulcerative Colitis Endoscopic Index of Severity (UCEIS)
.
Gut
2012
;
61
:
535
42
.

11.

Bounuen
G
,
Levesque
BG
,
Pola
S
, et al. .
Feasibility of endoscopic assessment and treating to target to achieve mucosal healing in ulcerative colitis
.
Inflamm Bowel Dis
2014
;
20
:
231
9
.

12.

Mazzuoli
S
,
Guglielmi
FW
,
Antonelli
E
, et al. .
Definition and evaluation of mucosal healing in clinical practice.
Dig Liv Dis
2013
;
45
:
969
77
.

13.

Reinink
AR
,
Lee
TC
,
Higgins
PD.
Endoscopic mucosal healing predicts favorable clinical outcomes in inflammatory bowel disease: a meta analysis
.
Inflamm Bowel Dis
2016
;
22
:
1859
69
.

14.

Ikeya
K
,
Hanai
H
,
Sugimoto
K
, et al. .
The Ulcerative Colitis Endoscopic Index of Severity more accurately reflects clinical outcomes and long-term prognosis than the Mayo Endoscopic Score
.
J Crohns Colitis.
2016
;
10
:
286
95
.

15.

Gottlieb
K
,
Daperno
M
,
Usiskin
K
, et al. .
Endoscopy and central reading in inflammatory bowel disease clinical trials: achievements, challenges, and future development
.
Gut
.
2021
;
70
:
418
26
. doi:10.1136/gutjnl-2020-320690.

16.

Ozawa
T
,
Ishihara
S
,
Fujishiro
M
, et al. .
Novel computer-assisted diagnosis system for endoscopic disease activity in patients with ulcerative colitis
.
Gastrointest Endosc
2019
;
89
:
416
21
.

17.

Takenaka
K
,
Ohtsuka
K
,
Fujii
T
, et al. .
Development and validation of a deep neural network for accurate evaluation of endoscopic images from patients with ulcerative colitis
.
Gastroenterology.
2020
;
158
:
2150
7
.

18.

Stidham
RW
,
Liu
W
,
Bishu
S
, et al. .
Performance of a deep learning model vs human reviewers in grading endoscopic disease severity of patients with ulcerative colitis
.
JAMA Netw Open
2019
;
2
:
e193963
.

19.

Yao
H
,
Najarian
K
,
Gryak
J
, et al. .
Fully automated endoscopic disease activity assessment in ulcerative colitis
.
Gastrointest Endosc
2021
;
93
:
728
736.e1
.

20.

Gottlieb
K
,
Requa
J
,
Karnes
W
, et al. .
Central reading of ulcerative colitis clinical trial videos using neural networks
.
Gastroenterology
2021
;
160
:
710
719.e2
.

21.

Panés
J
,
Feagan
BG
,
Hussain
F
, et al. .
Central endoscopy reading in inflammatory bowel diseases
.
J Crohns Colitis.
2016
;
10
:
S542
7
.

22.

Iacucci
M
,
Smith
SCL
,
Bazarova
A
, et al. .
An international multicenter real-life prospective study of electronic chromoendoscopy score PICaSSO in ulcerative colitis
.
Gastroenterology
.
2021
;
160
:
1558
1569.e8
. doi: 10.1053/j.gastro.2020.12.024.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)