-
PDF
- Split View
-
Views
-
Cite
Cite
Jana G Hashash, Faye Yu Ci Ng, Francis A Farraye, Yeli Wang, Daniel R Colucci, Shrujal Baxi, Sadaf Muneer, Mitchell Reddan, Pratik Shingru, Gil Y Melmed, Inter- and Intraobserver Variability on Endoscopic Scoring Systems in Crohn’s Disease and Ulcerative Colitis: A Systematic Review and Meta-Analysis, Inflammatory Bowel Diseases, Volume 30, Issue 11, November 2024, Pages 2217–2226, https://doi.org/10.1093/ibd/izae051
- Share Icon Share
Abstract
Endoscopy scoring is a key component in the diagnosis of ulcerative colitis (UC) and Crohn’s disease (CD). Variability in endoscopic scoring can impact patient trial eligibility and treatment effect measurement. In this study, we examine inter- and intraobserver variability of inflammatory bowel disease endoscopic scoring systems in a systematic review and meta-analysis.
We included observational studies that evaluated the inter- and intraobserver variability using UC (endoscopic Mayo Score [eMS], Ulcerative Colitis Endoscopic Index of Severity [UCEIS]) or CD (Crohn’s Disease Endoscopic Index of Severity [CDEIS], Simple Endoscopic Score for Crohn’s Disease [SES-CD]) systems among adults (≥18 years of age) and were published in English. The strength of agreement was categorized as fair, moderate, good, and very good.
A total of 6003 records were identified. After screening, 13 studies were included in our analysis. The overall interobserver agreement rates were 0.58 for eMS, 0.66 for UCEIS, 0.80 for CDEIS, and 0.78 for SES-CD. The overall heterogeneity (I2) for these systems ranged from 93.2% to 99.2%. A few studies assessed the intraobserver agreement rate. The overall effect sizes were 0.75 for eMS, 0.87 for UCEIS, 0.89 for CDEIS, and 0.91 for SES-CD.
The interobserver agreement rates for eMS, UCEIS, CDEIS, and SES-CD ranged from moderate to good. The intraobserver agreement rates for eMS, UCEIS, CDEIS, and SES-CD ranged from good to very good. Solutions to improve interobserver agreement could allow for more accurate patient assessment, leading to richer, more accurate clinical management and clinical trial data.
Lay Summary
This study examined the inter- and intraobserver variability of inflammatory bowel disease endoscopic scoring systems (endoscopic Mayo Score, Ulcerative Colitis Endoscopic Index of Severity, Crohn’s Disease Endoscopic Index of Severity, Simple Endoscopic Score for Crohn’s Disease) in a systematic review and meta-analysis.
Introduction
Inflammatory bowel disease (IBD), comprising Crohn’s disease (CD) and ulcerative colitis (UC), affects more than 6.8 million people worldwide,1 and has been steadily increasing in incidence globally in the 21st century.2 Endoscopy is a key component in the diagnosis, prognosis, and monitoring of these diseases, helping physicians accurately assess disease activity as well as determine and quantify response to therapy.3 Biopsies of intestinal mucosa can be obtained via endoscopy for histological examination, and endoscopic evaluation can be used to follow-up with patients for dysplasia and risk of colorectal cancer.4
In the last decade, endoscopic endpoints have been determined to be an important component of clinical trials and in practice according to evidence- and consensus-based recommendations by the International Organization for the Study of Inflammatory Bowel Diseases and the Food and Drug Administration.5 Concurrently, central reading of endoscopic videos has been deemed the gold standard for determining enrollment eligibility, as well as for evaluation of therapeutic endpoints in clinical trials for IBD patients.6 Efforts to measure IBD severity have resulted in numerous scoring systems being developed to allow for greater ease and consistency in reporting of colonoscopy video findings.7-9 In turn, this has led to a more standardized monitoring protocol among physicians.10
Commonly used endoscopic scores for IBD include the endoscopic Mayo Score (eMS) and Ulcerative Colitis Endoscopic Index of Severity (UCEIS) for UC and the Simple Endoscopic Score for Crohn’s Disease (SES-CD) and the Crohn’s Disease Endoscopic Index of Severity (CDEIS) for CD.11,12 Several studies have assessed the performance of these systems and have reached inconsistent results mainly due to human inter- and intravariability in the reporting of scoring outcomes and results.5,13-24 Previously identified sources of disagreement include distinguishing superficial ulcerations from deep ones, estimating disease extent, scoring of aphthous ulcers, and scoring beyond stenotic segments.15
We conducted a systematic review and meta-analysis of observational studies and clinical trials to determine the inter- and intraobserver variability of commonly used endoscopic scores (eMS and UCEIS for UC; SES-CD and CDEIS for CD). Our research aims to quantify the reliability and validity of endoscopic scoring systems and identify room to improve the sensitivity and reproducibility of these tools for both clinical practice and clinical trials.
Methods
Registration of Review Protocol
The current review was registered at the PROSPERO (CRD42022359749).
Data Sources and Searches
Two databases were systematically searched, including PubMed/MEDLINE and EMBASE, on September 14, 2022. References of previously relevant reviews and original studies were checked to identify potential studies not found in the 2 databases. Search terms included “ulcerative colitis,” “Crohn’s disease,” “endoscopic activity index,” “eMS,” “SES-CD,” “UCEIS,” “CDEIS,” “inter-observer,” “intra-observer,” and “agreement.” The detailed search strategy and numbers retrieved are presented in Supplemental Table 1. We followed the MOOSE (Meta-analysis Of Observational Studies in Epidemiology) and PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines.25,26
Study Selection
The present study included cross-sectional studies and clinical trials that assessed inter- and/or intraobserver variability using endoscopic scoring systems for assessing disease activity in UC and CD among adults (≥18 years of age). Included studies were full text, published in English, and on human subjects with a clear definition of exposures (eMS, SES-CD, UCEIS, CDEIS) and outcomes (UC and CD). Because previous studies reported kappa, intraclass correlation coefficient (ICC), and Krippendorff’s alpha as metrics to evaluate inter- and intraobserver variability, we included studies reporting these metrics. The detailed inclusion and exclusion criteria for the literature search are presented in Supplemental Table 2.
Data Extraction and Assessment of Quality
Data extraction was completed independently by 2 authors (Y.W., P.S.) and validated by 3 authors (Y.W., P.S., D.C.). We extracted information on study aim, study design, data characteristics (data type, procedure type, scores used), review process (number of reviewers, number of cases, scoring procedure, time points for scoring, training program administered), and analysis endpoints (metrics reported and results). For the subgroup analysis, we also extracted information on definitions of experts vs nonexperts, cutoff values of agreement metrics (kappa, ICC, and Krippendorff’s alpha), and the description of training programs. Data quality was assessed by the study quality assessment tool for observational cohort and cross-sectional studies from the National Heart, Lung, and Blood Institute.27 This included whether the exposure(s) of interest were measured prior to the outcome(s) being measured, whether the time frame was sufficient to reasonably observe an association between exposure and outcome, or if the loss to follow-up after baseline was 20% or less. A score of 1 was given for criteria that were met and a score of 0 was given for criteria that were not met or if they were not applicable. Overall, the score for a single study could range between 0 and 11.
Statistical Analysis
Analysis was performed between October 2022 and January 2023. Agreement metrics were pooled using a random-effects meta-analysis of kappa, ICC, or Krippendorff’s alpha from each study. The heterogeneity among studies was assessed with the I2 statistic, with 30% to 60% representing moderate heterogeneity and 75% to 100% representing considerable heterogeneity.28 Subgroup analysis was performed to explore potential sources of heterogeneity, including region (North America, Europe, Asia, multiple countries), study design (cross-sectional study, clinical trial), data characteristics (video, image, both video and image), procedure type (colonoscopy, ileocolonoscopy, endoscopy), number of reviewers and cases, time points for scoring (single, 2 or more), reviewer characteristics (experts, nonexperts), training program administered (yes, no), and statistical measurements (kappa, ICC, Krippendorff’s alpha). For the sensitivity analysis, we repeated the aforementioned analyses using an inverse-variance fixed-effects meta-analysis to examine the robustness of results. Publication bias was assessed with funnel plots and Egger regression tests.29 STATA 17.0 (StataCorp) was used for all analyses; 2-sided P values were considered statistically significant unless otherwise stated.
Results
Characteristics of Included Studies
The flowchart of the search and screening process is shown in Figure 1. The initial search identified 6003 citations. After removing duplicates; screening for title, abstract, and full text; and reviewing reference lists of relevant articles, a total of 13 studies were included in the current analysis (Figure 1).5,13-24 Of them, 9 studies had an outcome of UC and 7 studies had an outcome of CD (Supplemental Tables 3and4).

Flowchart of studies assessing inter- and intraobserver variability of endoscopic scoring systems in scoring endoscopic disease activity in Crohn’s disease and ulcerative colitis.
Among UC studies, 4 were conducted in Europe, 1 in North America, 1 in Asia, and 3 in multiple countries. Eight studies evaluated variability among experts, 5 evaluated variability among nonexperts, and 1 study assessed variability between both experts and nonexperts. Five studies provided a central reader training program prior to the study and 5 did not. Seven studies applied kappa to evaluate variability and 1 study used ICC (Supplemental Table 3).
Among CD studies, 3 were conducted in Europe, 3 in North America, and 1 in multiple countries. Three studies evaluated variability among experts, 5 evaluated variability among nonexperts, and 2 studies assessed variability between both experts and nonexperts. Six studies provided a central reader training program prior to the study and 2 did not. One study used kappa to assess variability and 6 studies used ICC (Supplemental Table 4).
For interobserver variability in endoscopic scoring, 7 studies used eMS and 2 used UCEIS for UC, while 6 used CDEIS and 7 used SES-CD for CD. For intraobserver variability, 1 study used eMS and 2 used UCEIS for UC, while 1 used CDEIS and 1 used SES-CD for CD (Supplemental Tables 3and4). The study quality varied between 6 and 8 (Supplemental Tables 5and6). Baseline characteristics are presented in Supplemental Tables 7and8.
Interobserver Variability in UC and CD
Results from included studies are shown in Supplemental Tables 9 to 12. The forest plots for the overall interobserver variability are shown in Figures 2 and 3. The random-effects pooled agreement rate was 0.58 (95% confidence interval [CI], 0.52-0.65; I2 = 97.6%) for eMS and 0.66 (95% CI, 0.34-0.99; I2 = 99.2%) for UCEIS (Figure 2). The agreement rate was 0.78 (95% CI, 0.64-0.92; I2 = 99.1%) for SES-CD and 0.80 (95% CI, 0.73-0.87; I2 = 93.2%) for CDEIS (Figure 3). Considerable heterogeneity was observed across all metrics (I2 > 75%).

Forest plot of the main analysis examining interobserver variability of endoscopic scoring systems in scoring endoscopic disease activity in ulcerative colitis. A, eMS, endoscopic Mayo Score; B, UCEIS, Ulcerative Colitis Endoscopic Index of Severity; CI, confidence interval; ICC, intraclass correlation coefficient; REML, restricted maximum likelihood.

Forest plot of the main analysis examining interobserver variability of endoscopic scoring systems in scoring endoscopic disease activity in Crohn’s disease. A, SES-CD, Simple Endoscopic Score for Crohn’s Disease; B, CDEIS, Crohn’s Disease Endoscopic Index of Severity; CI, confidence interval; ICC, intraclass correlation coefficient; REML, restricted maximum likelihood.
Subgroup analyses were conducted among 3 endoscopic activity scoring systems (eMS, SES-CD, UCEIS) with sufficient sample sizes; results are shown in Tables 1 and 2, Supplemental Figures 1 to 3, and Supplemental Table 13. The variability remained similar across most characteristics; however, significant differences in interobserver variability were consistently observed in subgroups: reviewers’ characteristics, training program, and the type of statistical measurement used to report variability.
Main and subgroup analysis of interobserver variability of endoscopic scoring systems in ulcerative colitis
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity among subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
eMS | |||||
Main estimate | 9 | 0.65 (0.64-0.66) | 0.58 (0.52-0.65) | 97.6 | – |
Data characteristics | .01 | ||||
Video | 4 | 0.64 (0.63-0.65) | 0.62 (0.55-0.68) | 94.6 | |
Image | 2 | 0.66 (0.65-0.67) | 0.58 (0.40-0.75) | 93.4 | |
Both image and video | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – | |
Procedure type | <.001 | ||||
Colonoscopy | 2 | 0.48 (0.42-0.54) | 0.48 (0.42-0.54) | 0 | |
Ileocolonoscopy | 2 | 0.66 (0.65-0.67) | 0.54 (0.44-0.65) | 81.2 | |
Endoscopy (not specified) | 3 | 0.58 (0.56-0.61) | 0.65 (0.60-0.71) | 96.7 | |
Reviewer cohort assessed | |||||
Assess variability between distinct groups (expert vs nonexpert) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | .19a |
Assess among groups | 6 | 0.59 (0.59-0.60) | 0.55 (0.48-0.62) | 98.3 | .31b |
Expert groups | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Experts | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Nonexpert groups | 5 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Nonexperts | 3 | 0.47 (0.46-0.48) | 0.53 (0.41-0.65) | 98.0 | |
Staff | 1 | 0.49 (0.37-0.61) | 0.49 (0.37-0.61) | – | |
Trainees | 1 | 0.48 (0.37-0.59) | 0.48 (0.37-0.59) | – | |
Training program administered | .01c | ||||
Yes | 3 | 0.67 (0.65-0.68) | 0.67 (0.60-0.74) | 94.6 | |
Among experts | 1 | 0.53 (0.49-0.57) | 0.53 (0.49-0.57) | – | |
Among nonexperts | 1 | 0.76 (0.73-0.79) | 0.76 (0.73-0.79) | – | |
Other (assessed variability between experts vs nonexperts) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | |
No | 5 | 0.65 (0.64-0.66) | 0.54 (0.46-0.62) | 91.0 | |
Among experts | 3 | 0.73 (0.73-0.74) | 0.61 (0.47-0.75) | 93.1 | |
Among nonexperts | 4 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
Kappa | 6 | 0.65 (0.64-0.66) | 0.61 (0.54-0.67) | 97.1 | |
Krippendorff’s alpha | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – |
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity among subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
eMS | |||||
Main estimate | 9 | 0.65 (0.64-0.66) | 0.58 (0.52-0.65) | 97.6 | – |
Data characteristics | .01 | ||||
Video | 4 | 0.64 (0.63-0.65) | 0.62 (0.55-0.68) | 94.6 | |
Image | 2 | 0.66 (0.65-0.67) | 0.58 (0.40-0.75) | 93.4 | |
Both image and video | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – | |
Procedure type | <.001 | ||||
Colonoscopy | 2 | 0.48 (0.42-0.54) | 0.48 (0.42-0.54) | 0 | |
Ileocolonoscopy | 2 | 0.66 (0.65-0.67) | 0.54 (0.44-0.65) | 81.2 | |
Endoscopy (not specified) | 3 | 0.58 (0.56-0.61) | 0.65 (0.60-0.71) | 96.7 | |
Reviewer cohort assessed | |||||
Assess variability between distinct groups (expert vs nonexpert) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | .19a |
Assess among groups | 6 | 0.59 (0.59-0.60) | 0.55 (0.48-0.62) | 98.3 | .31b |
Expert groups | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Experts | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Nonexpert groups | 5 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Nonexperts | 3 | 0.47 (0.46-0.48) | 0.53 (0.41-0.65) | 98.0 | |
Staff | 1 | 0.49 (0.37-0.61) | 0.49 (0.37-0.61) | – | |
Trainees | 1 | 0.48 (0.37-0.59) | 0.48 (0.37-0.59) | – | |
Training program administered | .01c | ||||
Yes | 3 | 0.67 (0.65-0.68) | 0.67 (0.60-0.74) | 94.6 | |
Among experts | 1 | 0.53 (0.49-0.57) | 0.53 (0.49-0.57) | – | |
Among nonexperts | 1 | 0.76 (0.73-0.79) | 0.76 (0.73-0.79) | – | |
Other (assessed variability between experts vs nonexperts) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | |
No | 5 | 0.65 (0.64-0.66) | 0.54 (0.46-0.62) | 91.0 | |
Among experts | 3 | 0.73 (0.73-0.74) | 0.61 (0.47-0.75) | 93.1 | |
Among nonexperts | 4 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
Kappa | 6 | 0.65 (0.64-0.66) | 0.61 (0.54-0.67) | 97.1 | |
Krippendorff’s alpha | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – |
Abbreviations: CI, confidence interval; eMS, endoscopic Mayo Score.
aCompared variability among groups and variability between distinct groups.
bCompared variability between experts and nonexperts.
cFor training programs administered compared between studies with and without training programs.
Main and subgroup analysis of interobserver variability of endoscopic scoring systems in ulcerative colitis
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity among subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
eMS | |||||
Main estimate | 9 | 0.65 (0.64-0.66) | 0.58 (0.52-0.65) | 97.6 | – |
Data characteristics | .01 | ||||
Video | 4 | 0.64 (0.63-0.65) | 0.62 (0.55-0.68) | 94.6 | |
Image | 2 | 0.66 (0.65-0.67) | 0.58 (0.40-0.75) | 93.4 | |
Both image and video | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – | |
Procedure type | <.001 | ||||
Colonoscopy | 2 | 0.48 (0.42-0.54) | 0.48 (0.42-0.54) | 0 | |
Ileocolonoscopy | 2 | 0.66 (0.65-0.67) | 0.54 (0.44-0.65) | 81.2 | |
Endoscopy (not specified) | 3 | 0.58 (0.56-0.61) | 0.65 (0.60-0.71) | 96.7 | |
Reviewer cohort assessed | |||||
Assess variability between distinct groups (expert vs nonexpert) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | .19a |
Assess among groups | 6 | 0.59 (0.59-0.60) | 0.55 (0.48-0.62) | 98.3 | .31b |
Expert groups | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Experts | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Nonexpert groups | 5 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Nonexperts | 3 | 0.47 (0.46-0.48) | 0.53 (0.41-0.65) | 98.0 | |
Staff | 1 | 0.49 (0.37-0.61) | 0.49 (0.37-0.61) | – | |
Trainees | 1 | 0.48 (0.37-0.59) | 0.48 (0.37-0.59) | – | |
Training program administered | .01c | ||||
Yes | 3 | 0.67 (0.65-0.68) | 0.67 (0.60-0.74) | 94.6 | |
Among experts | 1 | 0.53 (0.49-0.57) | 0.53 (0.49-0.57) | – | |
Among nonexperts | 1 | 0.76 (0.73-0.79) | 0.76 (0.73-0.79) | – | |
Other (assessed variability between experts vs nonexperts) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | |
No | 5 | 0.65 (0.64-0.66) | 0.54 (0.46-0.62) | 91.0 | |
Among experts | 3 | 0.73 (0.73-0.74) | 0.61 (0.47-0.75) | 93.1 | |
Among nonexperts | 4 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
Kappa | 6 | 0.65 (0.64-0.66) | 0.61 (0.54-0.67) | 97.1 | |
Krippendorff’s alpha | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – |
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity among subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
eMS | |||||
Main estimate | 9 | 0.65 (0.64-0.66) | 0.58 (0.52-0.65) | 97.6 | – |
Data characteristics | .01 | ||||
Video | 4 | 0.64 (0.63-0.65) | 0.62 (0.55-0.68) | 94.6 | |
Image | 2 | 0.66 (0.65-0.67) | 0.58 (0.40-0.75) | 93.4 | |
Both image and video | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – | |
Procedure type | <.001 | ||||
Colonoscopy | 2 | 0.48 (0.42-0.54) | 0.48 (0.42-0.54) | 0 | |
Ileocolonoscopy | 2 | 0.66 (0.65-0.67) | 0.54 (0.44-0.65) | 81.2 | |
Endoscopy (not specified) | 3 | 0.58 (0.56-0.61) | 0.65 (0.60-0.71) | 96.7 | |
Reviewer cohort assessed | |||||
Assess variability between distinct groups (expert vs nonexpert) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | .19a |
Assess among groups | 6 | 0.59 (0.59-0.60) | 0.55 (0.48-0.62) | 98.3 | .31b |
Expert groups | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Experts | 4 | 0.73 (0.72-0.74) | 0.59 (0.48-0.70) | 95.3 | |
Nonexpert groups | 5 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Nonexperts | 3 | 0.47 (0.46-0.48) | 0.53 (0.41-0.65) | 98.0 | |
Staff | 1 | 0.49 (0.37-0.61) | 0.49 (0.37-0.61) | – | |
Trainees | 1 | 0.48 (0.37-0.59) | 0.48 (0.37-0.59) | – | |
Training program administered | .01c | ||||
Yes | 3 | 0.67 (0.65-0.68) | 0.67 (0.60-0.74) | 94.6 | |
Among experts | 1 | 0.53 (0.49-0.57) | 0.53 (0.49-0.57) | – | |
Among nonexperts | 1 | 0.76 (0.73-0.79) | 0.76 (0.73-0.79) | – | |
Other (assessed variability between experts vs nonexperts) | 1 | 0.60 (0.58-0.62) | 0.60 (0.58-0.62) | – | |
No | 5 | 0.65 (0.64-0.66) | 0.54 (0.46-0.62) | 91.0 | |
Among experts | 3 | 0.73 (0.73-0.74) | 0.61 (0.47-0.75) | 93.1 | |
Among nonexperts | 4 | 0.47 (0.46-0.48) | 0.52 (0.44-0.60) | 95.6 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
Kappa | 6 | 0.65 (0.64-0.66) | 0.61 (0.54-0.67) | 97.1 | |
Krippendorff’s alpha | 1 | 0.47 (0.41-0.53) | 0.47 (0.41-0.53) | – |
Abbreviations: CI, confidence interval; eMS, endoscopic Mayo Score.
aCompared variability among groups and variability between distinct groups.
bCompared variability between experts and nonexperts.
cFor training programs administered compared between studies with and without training programs.
Main and subgroup analysis of interobserver variability of endoscopic scoring systems in Crohn’s disease
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity between subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
SES-CD | |||||
Main estimate | 7 | 0.88 (0.86-0.89) | 0.78 (0.64-0.92) | 99.1 | – |
Data characteristics | <.001 | ||||
Video | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Image | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
Procedure type | .54 | ||||
Colonoscopy | 1 | 0.83 (0.77-0.89) | 0.83 (0.77-0.89) | – | |
Ileocolonoscopy | 4 | 0.87 (0.85-0.88) | 0.73 (0.49-0.98) | 99.5 | |
Endoscopy (not specified) | 2 | 0.90 (0.88-0.92) | 0.87 (0.77-0.98) | 37.8 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.96) | 95.1 | .01a |
Assess variability among groups | 6 | 0.76 (0.73-0.79) | 0.70 (0.54-0.86) | 96.8 | |
Expert groups | 3 | 0.88 (0.84-0.92) | 0.88 (0.83-0.94) | 56.3 | <.001b |
Experts | 1 | 0.93 (0.87-0.99) | 0.93 (0.87-0.99) | – | |
Central readers | 2 | 0.86 (0.81-0.91) | 0.88 (0.80-0.92) | 34.0 | |
Nonexpert groups | 4 | 0.63 (0.59-0.67) | 0.58 (0.39-0.78) | 95.2 | |
Nonexperts | 3 | 0.73 (0.68-0.77) | 0.73 (0.65-0.81) | 55.7 | |
Staff | 1 | 0.37 (0.26-0.48) | 0.37 (0.26-0.48) | – | |
Trainees | 1 | 0.32 (0.21-0.43) | 0.32 (0.21-0.43) | – | |
Training program administered | .020c | ||||
Yes | 6 | 0.87 (0.86-0.89) | 0.86 (0.82-0.90) | 77.8 | |
Among experts | 3 | 0.87 (0.82-0.91) | 0.87 (0.82-0.92) | 27.8 | |
Among nonexperts | 2 | 0.83 (0.80-0.86) | 0.83 (0.79-0.86) | 0 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.86) | 0.07 | |
No | 2 | 0.39 (0.28-0.49) | 0.48 (0.17-0.80) | 74.1 | |
Among experts | 0 | – | – | – | |
Among nonexperts | 2 | 0.39 (0.28-0.49) | 0.41 (0.25-0.56) | 45.8 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
ICC | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Kappa | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
CDEIS | |||||
Main estimate | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | – |
Data characteristics | – | ||||
Video | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Image | 0 | – | – | – | |
Procedure type | <.001 | ||||
Colonoscopy | 1 | 0.71 (0.65-0.77) | 0.71 (0.65-0.77) | – | |
Ileocolonoscopy | 3 | 0.87 (0.84-0.89) | 0.81 (0.71-0.92) | 91.1 | |
Endoscopy (not specified) | 2 | 0.86 (0.83-0.89) | 0.87 (0.83-0.90) | 4.27 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert readers) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | <.001a |
Assess among groups | 5 | 0.76 (0.73-0.80) | 0.77 (0.71-0.84) | 63.4 | .29b |
Expert groups | 3 | 0.79 (0.74-0.83) | 0.80 (0.70-0.90) | 77.9 | |
Experts | 1 | 0.83 (0.71-0.95) | 0.83 (0.71-0.95) | – | |
Central readers | 2 | 0.78 (0.73-0.83) | 0.79 (0.63-0.95) | 90.0 | |
Nonexpert groups | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Nonexperts | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Training program administered | .08c | ||||
Yes | 6 | 0.86 (0.85-0.87) | 0.81 (0.74-0.88) | 93.2 | |
Among experts | 3 | 0.78 (0.74-0.83) | 0.80 (0.69-0.90) | 77.2 | |
Among nonexperts | 2 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0.01 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | |
No | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Among experts | 0 | – | – | – | |
Among nonexperts | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | – | ||||
ICC | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Kappa | 0 | – | – | – |
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity between subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
SES-CD | |||||
Main estimate | 7 | 0.88 (0.86-0.89) | 0.78 (0.64-0.92) | 99.1 | – |
Data characteristics | <.001 | ||||
Video | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Image | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
Procedure type | .54 | ||||
Colonoscopy | 1 | 0.83 (0.77-0.89) | 0.83 (0.77-0.89) | – | |
Ileocolonoscopy | 4 | 0.87 (0.85-0.88) | 0.73 (0.49-0.98) | 99.5 | |
Endoscopy (not specified) | 2 | 0.90 (0.88-0.92) | 0.87 (0.77-0.98) | 37.8 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.96) | 95.1 | .01a |
Assess variability among groups | 6 | 0.76 (0.73-0.79) | 0.70 (0.54-0.86) | 96.8 | |
Expert groups | 3 | 0.88 (0.84-0.92) | 0.88 (0.83-0.94) | 56.3 | <.001b |
Experts | 1 | 0.93 (0.87-0.99) | 0.93 (0.87-0.99) | – | |
Central readers | 2 | 0.86 (0.81-0.91) | 0.88 (0.80-0.92) | 34.0 | |
Nonexpert groups | 4 | 0.63 (0.59-0.67) | 0.58 (0.39-0.78) | 95.2 | |
Nonexperts | 3 | 0.73 (0.68-0.77) | 0.73 (0.65-0.81) | 55.7 | |
Staff | 1 | 0.37 (0.26-0.48) | 0.37 (0.26-0.48) | – | |
Trainees | 1 | 0.32 (0.21-0.43) | 0.32 (0.21-0.43) | – | |
Training program administered | .020c | ||||
Yes | 6 | 0.87 (0.86-0.89) | 0.86 (0.82-0.90) | 77.8 | |
Among experts | 3 | 0.87 (0.82-0.91) | 0.87 (0.82-0.92) | 27.8 | |
Among nonexperts | 2 | 0.83 (0.80-0.86) | 0.83 (0.79-0.86) | 0 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.86) | 0.07 | |
No | 2 | 0.39 (0.28-0.49) | 0.48 (0.17-0.80) | 74.1 | |
Among experts | 0 | – | – | – | |
Among nonexperts | 2 | 0.39 (0.28-0.49) | 0.41 (0.25-0.56) | 45.8 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
ICC | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Kappa | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
CDEIS | |||||
Main estimate | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | – |
Data characteristics | – | ||||
Video | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Image | 0 | – | – | – | |
Procedure type | <.001 | ||||
Colonoscopy | 1 | 0.71 (0.65-0.77) | 0.71 (0.65-0.77) | – | |
Ileocolonoscopy | 3 | 0.87 (0.84-0.89) | 0.81 (0.71-0.92) | 91.1 | |
Endoscopy (not specified) | 2 | 0.86 (0.83-0.89) | 0.87 (0.83-0.90) | 4.27 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert readers) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | <.001a |
Assess among groups | 5 | 0.76 (0.73-0.80) | 0.77 (0.71-0.84) | 63.4 | .29b |
Expert groups | 3 | 0.79 (0.74-0.83) | 0.80 (0.70-0.90) | 77.9 | |
Experts | 1 | 0.83 (0.71-0.95) | 0.83 (0.71-0.95) | – | |
Central readers | 2 | 0.78 (0.73-0.83) | 0.79 (0.63-0.95) | 90.0 | |
Nonexpert groups | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Nonexperts | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Training program administered | .08c | ||||
Yes | 6 | 0.86 (0.85-0.87) | 0.81 (0.74-0.88) | 93.2 | |
Among experts | 3 | 0.78 (0.74-0.83) | 0.80 (0.69-0.90) | 77.2 | |
Among nonexperts | 2 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0.01 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | |
No | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Among experts | 0 | – | – | – | |
Among nonexperts | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | – | ||||
ICC | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Kappa | 0 | – | – | – |
Abbreviations: CDEIS, Crohn’s Disease Endoscopic Index of Severity; CI, confidence interval; ICC, intraclass correlation coefficient; SES-CD, Simple Endoscopic Score for Crohn’s Disease.
aCompared variability among groups and variability between distinct groups.
bCompared variability between experts and nonexperts.
cFor training programs administered compared between studies with and without training programs.
Main and subgroup analysis of interobserver variability of endoscopic scoring systems in Crohn’s disease
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity between subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
SES-CD | |||||
Main estimate | 7 | 0.88 (0.86-0.89) | 0.78 (0.64-0.92) | 99.1 | – |
Data characteristics | <.001 | ||||
Video | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Image | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
Procedure type | .54 | ||||
Colonoscopy | 1 | 0.83 (0.77-0.89) | 0.83 (0.77-0.89) | – | |
Ileocolonoscopy | 4 | 0.87 (0.85-0.88) | 0.73 (0.49-0.98) | 99.5 | |
Endoscopy (not specified) | 2 | 0.90 (0.88-0.92) | 0.87 (0.77-0.98) | 37.8 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.96) | 95.1 | .01a |
Assess variability among groups | 6 | 0.76 (0.73-0.79) | 0.70 (0.54-0.86) | 96.8 | |
Expert groups | 3 | 0.88 (0.84-0.92) | 0.88 (0.83-0.94) | 56.3 | <.001b |
Experts | 1 | 0.93 (0.87-0.99) | 0.93 (0.87-0.99) | – | |
Central readers | 2 | 0.86 (0.81-0.91) | 0.88 (0.80-0.92) | 34.0 | |
Nonexpert groups | 4 | 0.63 (0.59-0.67) | 0.58 (0.39-0.78) | 95.2 | |
Nonexperts | 3 | 0.73 (0.68-0.77) | 0.73 (0.65-0.81) | 55.7 | |
Staff | 1 | 0.37 (0.26-0.48) | 0.37 (0.26-0.48) | – | |
Trainees | 1 | 0.32 (0.21-0.43) | 0.32 (0.21-0.43) | – | |
Training program administered | .020c | ||||
Yes | 6 | 0.87 (0.86-0.89) | 0.86 (0.82-0.90) | 77.8 | |
Among experts | 3 | 0.87 (0.82-0.91) | 0.87 (0.82-0.92) | 27.8 | |
Among nonexperts | 2 | 0.83 (0.80-0.86) | 0.83 (0.79-0.86) | 0 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.86) | 0.07 | |
No | 2 | 0.39 (0.28-0.49) | 0.48 (0.17-0.80) | 74.1 | |
Among experts | 0 | – | – | – | |
Among nonexperts | 2 | 0.39 (0.28-0.49) | 0.41 (0.25-0.56) | 45.8 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
ICC | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Kappa | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
CDEIS | |||||
Main estimate | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | – |
Data characteristics | – | ||||
Video | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Image | 0 | – | – | – | |
Procedure type | <.001 | ||||
Colonoscopy | 1 | 0.71 (0.65-0.77) | 0.71 (0.65-0.77) | – | |
Ileocolonoscopy | 3 | 0.87 (0.84-0.89) | 0.81 (0.71-0.92) | 91.1 | |
Endoscopy (not specified) | 2 | 0.86 (0.83-0.89) | 0.87 (0.83-0.90) | 4.27 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert readers) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | <.001a |
Assess among groups | 5 | 0.76 (0.73-0.80) | 0.77 (0.71-0.84) | 63.4 | .29b |
Expert groups | 3 | 0.79 (0.74-0.83) | 0.80 (0.70-0.90) | 77.9 | |
Experts | 1 | 0.83 (0.71-0.95) | 0.83 (0.71-0.95) | – | |
Central readers | 2 | 0.78 (0.73-0.83) | 0.79 (0.63-0.95) | 90.0 | |
Nonexpert groups | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Nonexperts | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Training program administered | .08c | ||||
Yes | 6 | 0.86 (0.85-0.87) | 0.81 (0.74-0.88) | 93.2 | |
Among experts | 3 | 0.78 (0.74-0.83) | 0.80 (0.69-0.90) | 77.2 | |
Among nonexperts | 2 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0.01 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | |
No | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Among experts | 0 | – | – | – | |
Among nonexperts | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | – | ||||
ICC | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Kappa | 0 | – | – | – |
Characteristic . | Studies . | Agreement rate (95% CI) . | I2 (%) . | P value for heterogeneity between subgroups . | |
---|---|---|---|---|---|
Inverse-variance fixed-effects meta-analysis . | Random-effects meta-analysis . | ||||
SES-CD | |||||
Main estimate | 7 | 0.88 (0.86-0.89) | 0.78 (0.64-0.92) | 99.1 | – |
Data characteristics | <.001 | ||||
Video | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Image | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
Procedure type | .54 | ||||
Colonoscopy | 1 | 0.83 (0.77-0.89) | 0.83 (0.77-0.89) | – | |
Ileocolonoscopy | 4 | 0.87 (0.85-0.88) | 0.73 (0.49-0.98) | 99.5 | |
Endoscopy (not specified) | 2 | 0.90 (0.88-0.92) | 0.87 (0.77-0.98) | 37.8 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.96) | 95.1 | .01a |
Assess variability among groups | 6 | 0.76 (0.73-0.79) | 0.70 (0.54-0.86) | 96.8 | |
Expert groups | 3 | 0.88 (0.84-0.92) | 0.88 (0.83-0.94) | 56.3 | <.001b |
Experts | 1 | 0.93 (0.87-0.99) | 0.93 (0.87-0.99) | – | |
Central readers | 2 | 0.86 (0.81-0.91) | 0.88 (0.80-0.92) | 34.0 | |
Nonexpert groups | 4 | 0.63 (0.59-0.67) | 0.58 (0.39-0.78) | 95.2 | |
Nonexperts | 3 | 0.73 (0.68-0.77) | 0.73 (0.65-0.81) | 55.7 | |
Staff | 1 | 0.37 (0.26-0.48) | 0.37 (0.26-0.48) | – | |
Trainees | 1 | 0.32 (0.21-0.43) | 0.32 (0.21-0.43) | – | |
Training program administered | .020c | ||||
Yes | 6 | 0.87 (0.86-0.89) | 0.86 (0.82-0.90) | 77.8 | |
Among experts | 3 | 0.87 (0.82-0.91) | 0.87 (0.82-0.92) | 27.8 | |
Among nonexperts | 2 | 0.83 (0.80-0.86) | 0.83 (0.79-0.86) | 0 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.88 (0.86-0.90) | 0.88 (0.79-0.86) | 0.07 | |
No | 2 | 0.39 (0.28-0.49) | 0.48 (0.17-0.80) | 74.1 | |
Among experts | 0 | – | – | – | |
Among nonexperts | 2 | 0.39 (0.28-0.49) | 0.41 (0.25-0.56) | 45.8 | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | <.001 | ||||
ICC | 6 | 0.88 (0.87-0.89) | 0.86 (0.82-0.90) | 85.9 | |
Kappa | 1 | 0.35 (0.24-0.46) | 0.35 (0.24-0.46) | – | |
CDEIS | |||||
Main estimate | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | – |
Data characteristics | – | ||||
Video | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Image | 0 | – | – | – | |
Procedure type | <.001 | ||||
Colonoscopy | 1 | 0.71 (0.65-0.77) | 0.71 (0.65-0.77) | – | |
Ileocolonoscopy | 3 | 0.87 (0.84-0.89) | 0.81 (0.71-0.92) | 91.1 | |
Endoscopy (not specified) | 2 | 0.86 (0.83-0.89) | 0.87 (0.83-0.90) | 4.27 | |
Study aim | |||||
Assess variability between distinct groups (expert vs nonexpert readers) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | <.001a |
Assess among groups | 5 | 0.76 (0.73-0.80) | 0.77 (0.71-0.84) | 63.4 | .29b |
Expert groups | 3 | 0.79 (0.74-0.83) | 0.80 (0.70-0.90) | 77.9 | |
Experts | 1 | 0.83 (0.71-0.95) | 0.83 (0.71-0.95) | – | |
Central readers | 2 | 0.78 (0.73-0.83) | 0.79 (0.63-0.95) | 90.0 | |
Nonexpert groups | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Nonexperts | 3 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0 | |
Training program administered | .08c | ||||
Yes | 6 | 0.86 (0.85-0.87) | 0.81 (0.74-0.88) | 93.2 | |
Among experts | 3 | 0.78 (0.74-0.83) | 0.80 (0.69-0.90) | 77.2 | |
Among nonexperts | 2 | 0.74 (0.69-0.79) | 0.74 (0.69-0.79) | 0.01 | |
Other (assessed variability between experts vs nonexperts) | 2 | 0.89 (0.87-0.91) | 0.89 (0.86-0.92) | 23.4 | |
No | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Among experts | 0 | – | – | – | |
Among nonexperts | 1 | 0.67 (0.54-0.80) | 0.67 (0.54-0.80) | – | |
Other (assessed variability between experts vs nonexperts) | 0 | – | – | – | |
Statistical measurements | – | ||||
ICC | 6 | 0.86 (0.85-0.87) | 0.80 (0.73-0.87) | 93.2 | |
Kappa | 0 | – | – | – |
Abbreviations: CDEIS, Crohn’s Disease Endoscopic Index of Severity; CI, confidence interval; ICC, intraclass correlation coefficient; SES-CD, Simple Endoscopic Score for Crohn’s Disease.
aCompared variability among groups and variability between distinct groups.
bCompared variability between experts and nonexperts.
cFor training programs administered compared between studies with and without training programs.
Experts vs nonexperts
The interobserver variability was significantly lower in experts compared with nonexperts. The pooled agreement rate was 0.88 (95% CI, 0.83-0.94) for experts and 0.58 (95% CI, 0.39-0.45) for nonexperts for SES-CD (P < .001). Results from meta-regression analyses also showed that experts had significantly better agreement than nonexperts for SES-CD (P = .012) (Supplemental Table 14). Although the difference was not statistically significant for eMS and CDEIS, a similar trend was observed. For eMS, the agreement rate improved from 0.53 (95% CI, 0.41-0.65) for nonexperts to 0.59 (95% CI, 0.48-0.70) for experts (P = .31); for CDEIS, it was 0.74 (95% CI, 0.69-0.79) in nonexperts and 0.80 (95% CI, 0.70-0.90) in experts (P = .29). The definition of experts and nonexperts from included studies and the current analysis is shown in Supplemental Tables 15and16. In addition, the distributions of numbers of reviewers vs cases are shown in Supplemental Figures 5 and 6. Notably, the number of reviewers, the number of cases reviewed, and the average cases reviewed by one reviewer did not impact the overall interobserver variability (Supplemental Table 13).
Training program
For all 3 scoring systems, a significantly lower interobserver variability was observed for studies that provide training programs prior to scoring the disease activity compared with studies without a training program. For eMS, the agreement rate was 0.67 (95% CI, 0.60-0.74) and 0.54 (95% CI, 0.46-0.62) for studies with and without a training program, respectively (P = .01). For SES-CD, it was 0.86 (95% CI, 0.82-0.90) for studies with a training program and 0.48 (95% CI, 0.17-0.80) for those without a training program (P = .02). For CDEIS, it was 0.81 (95% CI, 0.74-0.88) for studies with a training program and 0.67 (95% CI, 0.54-0.80) for those without a training program (P = .08). The detailed description of training programs from included studies is shown in Supplemental Tables 17and18. Results from meta-regression analyses also showed that studies with training programs had significantly better agreement than studies without training programs for eMS (P = .02) and SES-CD (P < .001) (Supplemental Table 14).
Statistical measurements
Statistical significance was observed in studies using different statistical measurements. For eMS, studies using kappa had higher agreement rate compared with studies using Krippendorff’s alpha (0.61 [95% CI, 0.54-0.67] vs 0.47 [95% CI, 0.41-0.53]; P < .001). For SES-CD, studies that used ICC reported higher agreement rate than studies that used kappa (0.86 [95% CI, 0.82-0.90] vs 0.35 [95% CI, 0.24-0.46]; P < .001). Studies on CDEIS all used ICC (0.80 [95% CI, 0.73-0.87]). Results from meta-regression analyses showed that studies using ICC had significantly better agreement than studies using kappa for SES-CD (P < .001) (Supplemental Table 14). Cutoffs of kappa, ICC, and Krippendorff’s alpha, and strength of agreement defined in included studies, are shown in Supplemental Tables 19and20. Despite some inconsistencies, most studies considered agreement to be poor when the kappa (or ICC) was <0.20, fair when the score was between 0.21 and 0.40, moderate when the score was between 0.41 and 0.60, good or substantial when the score was between 0.61 and 0.80, and very good when the score was >0.80.
Two studies15,20 further conducted exploratory analyses to evaluate the interobserver variability by components of 2 of the CD scoring systems (SES-CD and CDEIS), or by segments of the colon (Supplemental Table 21). The interobserver variability was comparable across locations of the colon at the rectum, sigmoid, transverse colon, right colon, and ileum (agreement ranged from 0.77 to 0.84 for SES-CD and from 0.68 to 0.76 for CDEIS). Among the components of both scores, stenosis had the lowest agreement (0.46 for SES-CD; 0.29 for CDEIS), while the remaining components had similar agreement (Supplemental Table 21).
Intraobserver Variability in UC and CD
Intraobserver variability was assessed in 1 study on eMS, 2 on UCEIS, 1 on CDEIS, and 1 on SES-CD (Supplemental Tables 3and4). The intraobserver variability was 0.75 (95% CI, 0.73-0.77) for eMS (measured by kappa), 0.87 (95% CI, 0.84-0.91) for UCEIS (measured by ICC and kappa), 0.91 (95% CI, 0.89-0.95) for SES-CD (measured by ICC), and 0.89 (95% CI, 0.86-0.93) for CDEIS (measured by ICC) (Supplemental Tables 9-12).
Publication Bias and Risk of Bias in Included Studies
Visual inspection of funnel plots suggested some degree of publication bias. The majority of studies (5 out of 7 for eMS, 2 out of 2 for UCEIS, 4 out of 7 for SES-CD, and 3 out of 6 for CDEIS) lay outside of the pseudo 95% CI of the funnel plots (Supplemental Figures 6and7). Sensitivity analyses of leaving 1 study out showed that no individual study had a significant impact on the overall interobserver variability on eMS and CDEIS. For SES-CD, excluding the study conducted by Hart et al29 significantly improved the agreement. For the 2 studies that included UCEIS, excluding either study impacted the overall interobserver variability (Supplemental Figures 8and9).
Discussion
In this systematic review and meta-analysis, we provided a pooled summary of inter- and intraobserver agreement rates on commonly used endoscopic scoring systems in IBD. For interobserver agreement rates, we found moderate-to-good agreement for the scoring systems evaluating UC and good agreement for both scoring systems evaluating CD. For intraobserver agreement rates, we found good-to-very good agreement across all scoring systems evaluating UC and CD. Our findings are in alignment with previous studies that have shown moderate-to-good inter- and intraobserver agreement rates in various endoscopic scoring systems.30
We observed that the inter- and intraobserver agreement rates for scoring systems evaluating CD outperformed those evaluating UC, remaining high around the 0.80 range. This could be attributed to the greater complexity and detail present in endoscopic scoring systems for CD, CDEIS and SES-CD, than the scoring systems for UC, eMS and UCEIS. For example, the CDEIS and SES-CD involve an estimation of ulcer size and surface area, as well as assignment of a score to each individual segment of the bowel. Therefore, it is possible that the higher levels of precision and accuracy seen for CD scoring systems result from the additional time and attention required to calculate these scores. Although exhibiting better agreement, this more detailed scoring has been criticized as tedious, time-consuming, and not easily applicable to routine clinical practice or clinical trials.12
In the subgroup analysis of interobserver agreement, our study revealed 2 factors that may be crucial to improving endoscopic assessment: training and expert qualifications. A significantly higher interobserver agreement was observed for studies that provided training programs prior to scoring the disease activity compared with studies without such a training program. This suggests that endoscopic assessment is a skill that can be improved upon with proper guidance and practice. Furthermore, expertise is critical to the accurate and consistent assessment of endoscopic data, as experts demonstrated significantly higher interobserver agreement compared with nonexperts. Nevertheless, training is a labor- and resource-intensive process, and the accumulation of experience and the skills required to transition a gastroenterologist trainee from novice to expert requires significant investments in time, money, and effort. At this time, there is a lack of experienced human readers adequately trained to apply these scoring systems in a standardized manner. The complexity, interobserver variability, and incomplete validation of these scores further hinder efforts to establish them in mainstream clinical practice.30 As such, although various endoscopic scoring systems have been developed for the assessment of disease severity in IBD, these scores are seldom utilized for decision making or patient management. Similarly, central reading by experienced and trained reviewers are upheld as the gold standard in IBD clinical trials but are impractical for use in the clinical setting due to high cost and insufficient available human expertise.
One solution to address inter- and intraobserver variability, which can scale better than training and employing expert reviewers for central reading, may be the adoption of artificial intelligence (AI). AI has been trained to recognize key endoscopic features of IBD such as hyperemia, granularity, aphthoid lesions, ulcers, and stenosis,31 based on labels annotated by trained experts. These algorithms can also be taught to determine the location and extent of inflammation, and the severity of disease, aiding treatment planning and prognosis.32 Future endoscopic disease scoring paradigms could improve the accuracy and reliability of assessments through the use of AI to supplement or replace expert reviewers.
Preliminary automated endoscopic disease scoring approaches have been attempted. Takenaka et al33 developed a deep neural network for the evaluation of endoscopic images in UC patients and identified endoscopic remission with 90.1% accuracy and histologic remission with 92.9% accuracy. Meanwhile, Yao et al32 piloted a fully automated video analysis system for grading endoscopic disease in UC with a 69.5% central reviewer scoring agreement when accounting for inter-reviewer variability. Critical to the success of these systems is understanding the overall quality and agreement rates of expert endoscopic reviewers, as this helps both inform the quality of data that can be used to build these automated solutions as well as quantify the accuracy of the expert reviewer–scored reference standard that these systems aim to replicate. A successful technical solution such as AI-enabled standardized scoring could bridge the current known divide between central reading utilized in clinical trials and current endoscopic scoring practices. This initial application of AI in endoscopy in IBD may eventually provide a foundation for more granular, novel assessments of inflammation that are clinically relevant and hold predictive power for a patient’s course of disease.
The strengths of our study lie in the large number of systematically included studies from diverse settings and contexts, increasing the generalizability of our findings. To the best of our knowledge, this is the first systematic review and meta-analysis, and the most comprehensive evidence-based synthesis to date, evaluating inter- and intraobserver variability on endoscopic scoring systems in IBD. We included studies assessing inter- and intraobserver variability of endoscopic scoring in both UC and CD for both experts and nonexperts, as well as before and after training, exploring many scenarios in which variability in the assessment of endoscopic disease activity may be present. We applied a rigorous, predetermined protocol of systematic searching, bias assessment, and quality assessment in accordance with published guidelines. We executed a comprehensive data extraction of study characteristics, enabling us to perform subgroup analyses on a wide variety of variables to evaluate for cause of heterogeneity between studies. In addition, we designed quality assessment in accordance with recommended guidelines and noted similar quality among included studies.
Nonetheless, some limitations should be acknowledged. First, heterogeneity of included studies was high, limiting our ability to meaningfully pool data. This might be attributed to the varied settings and contexts the studies were taken from, as well as inherent differences in experience and expertise between experts and nonexperts. Second, included studies used various metrics to evaluate inter- and intraobserver variability, including kappa, ICC, and Krippendorff’s alpha, which have different statistical specificities and implications preventing a common basis of comparison.34 We were unable to ascertain if the type of metric used in scoring systems significantly affected our results, as only 1 study used Krippendorff’s alpha for eMS, and 1 study used kappa for SES-CD. Third, there was inconsistency in the type of data used for endoscopic assessments among different studies, with some studies using videos, others using images, and a few using both modalities. Evaluation of endoscopic images alone may not give us a representative picture of the overall disease activity. We were again unable to ascertain if the difference is truly significant in subgroup analyses due to the limited number of studies available that used only images. Last, there was evidence of publication bias detected via asymmetry in the funnel plots.
Conclusions
Endoscopic assessment is a critical component for diagnosis and monitoring in patients with IBD. Although inter- and intraobserver agreement on endoscopic scoring systems in UC and CD are generally good, there exists room for improvement, especially with regard to eMS for UC, which only demonstrates moderate interobserver agreement. In current IBD clinical practice, the use of such scoring systems is limited by the lack of available experienced human readers adequately trained that would allow them to be applied in a standardized manner. In addition, central reading for clinical trials is a time- and resource-intensive commitment. Emerging technologies like AI may hold the key for automating image analysis and endoscopic assessment to standardize implementation in practice and across clinical trials.
Supplementary data
Supplementary data is available at Inflammatory Bowel Diseases online.
Funding
Support for his study was provided by Iterative Health, Inc.
Conflicts of Interest
J.G.H. has served on the advisory board for BMS. F.A.F has the following to disclose: has served on the advisory committee or as a board member for AbbVie, Avalo Therapeutics, BMS, Braintree Labs, Fresenius Kabi, GSK, Iterative Health, Janssen, Pfizer, Pharmacosmos, Sandoz Immunology, Sebela, and Viatris; and as an independent contractor for GI Reviewers and IBD Educational Group. Y.W., D.R.C., S.B. are employees of and hold stock or stock options in Iterative Health, Inc. G.Y.M. has served as a consultant for AbbVie, Amgen, Arena, BI, BMS, Dieta, Ferring, Fresenius Kabi, Genentech, Gilead, Janssen, Oshi, Pfizer, Prometheus Labs, Samsung Bioepis, Takeda, Techlab, Verantos, and Viatris; is an owner and has ownership interest in Dieta and Oshi; is an employee of Eli Lilly; has served as an independent contractor for GI Reviewers; and has received grant/research support from Pfizer. The remaining authors disclose no conflicts.
References
Author notes
Jana G Hashash and Faye Yu Ci Ng contributed equally to this work.