Identifying the Presence, Activity, and Status of Extraintestinal Manifestations of Inflammatory Bowel Disease Using Natural Language Processing of Clinical Notes

Patient Characteristics

Age, y

41.8 ± 14.2

Male

588 (47.4)

IBD type, Crohn’s disease

646 (52.1)

Smoking history

336 (27.1)

Medication exposure

5-ASA

522 (42.1)

Thiopurine

459 (37)

Methotrexate

95 (7.7)

Biologic exposure

Adalimumab

197 (15.9)

Infliximab

169 (13.6)

Certolizumab pegol

16 (1.3)

Golimumab

5 (0.4)

Vedolizumab

71 (5.7)

Ustekinumab

36 (2.9)

Values are mean ± SD or n (%).

Abbreviation: 5-ASA, 5-aminosalicylic acid; IBD, inflammatory bowel disease.

Human Reviewer Detection of EIMs and Agreement on EIM Status

Among the 1702 identified mentions, EIMs were composed of arthritis (54.4%), ocular disease (17.2%), psoriasis (17.0%), erythema nodosum (4.0%), pyoderma gangrenosum (3.8%), and hidradenitis suppurativa (3.8%) EIM types. Overall, reviewer detection of EIM mention was excellent, with both reviewers agreeing on detection in 94.6% of all EIMs (range by EIM type 87.3%-96.6%) (Table 2). Reviewers agreed on exact EIM activity status in 76.2% of EIMs identified (κ = 0.74; 95% CI, 0.70-0.78). Exact agreement on specific activity class varied by EIM type, ranging from very good for arthritis (κ = 0.74) and psoriasis (κ = 0.75) to only fair for hidradenitis suppurativa (κ = 0.45).

Table 2.

Comparison of Overall Status Agreement Between Human Reviewers

Comparison of Paired Human Reviewers on EIM Status Determination
EIM Status	Reviewer 1
	Not Active			Active
	Negated	Historical	Resolved	Improved	Worsened	Uncertain
Negated	166	2	8	3	1	11
Historical	8	516	11	14	1	108
Resolved	16	3	63	8	3	12
Improved	2	5	11	94	11	42
Worsened	2	2	1	12	114	43
Uncertain	3	10	10	21	21	344

Comparison of Paired Human Reviewers on EIM Status Determination
EIM Status	Reviewer 1
	Not Active			Active
	Negated	Historical	Resolved	Improved	Worsened	Uncertain
Negated	166	2	8	3	1	11
Historical	8	516	11	14	1	108
Resolved	16	3	63	8	3	12
Improved	2	5	11	94	11	42
Worsened	2	2	1	12	114	43
Uncertain	3	10	10	21	21	344

Comparison of Human Reviewer Detection and Agreement on EIM Status
EIM Type	Detection Accuracy	Specific EIM Agreement		General EIM Agreement
		Accuracy	Agreement	Accuracy	Agreement
		Accuracy	Kappa (95% CI)	Accuracy	Kappa (95% CI)
Arthritis	96.6%	0.80	0.74 (0.69-0.80)	0.89	0.78 (0.74-0.82)
Psoriasis	90.8%	0.72	0.75 (0.66-0.85)	0.81	0.62 (0.52-0.73)
Ocular EIM	96.4%	0.70	0.69 (0.58-0.80)	0.85	0.71 (0.62-0.78)
Erythema nodosum	87.3%	0.85	0.79 (0.62-0.96)	0.88	0.74 (0.57-0.91)
Pyoderma gangrenosum	92.3%	0.75	0.72 (0.53-0.91)	0.89	0.77 (0.61-0.93)
Hidradenitis suppurativa	96.4%	0.63	0.45 (0.13-0.77)	0.86	0.39 (0.03-0.76)
Overall	94.6%	0.76	0.74 (0.70-0.78)	0.88	0.76 (0.73-0.79)

Comparison of Human Reviewer Detection and Agreement on EIM Status
EIM Type	Detection Accuracy	Specific EIM Agreement		General EIM Agreement
		Accuracy	Agreement	Accuracy	Agreement
		Accuracy	Kappa (95% CI)	Accuracy	Kappa (95% CI)
Arthritis	96.6%	0.80	0.74 (0.69-0.80)	0.89	0.78 (0.74-0.82)
Psoriasis	90.8%	0.72	0.75 (0.66-0.85)	0.81	0.62 (0.52-0.73)
Ocular EIM	96.4%	0.70	0.69 (0.58-0.80)	0.85	0.71 (0.62-0.78)
Erythema nodosum	87.3%	0.85	0.79 (0.62-0.96)	0.88	0.74 (0.57-0.91)
Pyoderma gangrenosum	92.3%	0.75	0.72 (0.53-0.91)	0.89	0.77 (0.61-0.93)
Hidradenitis suppurativa	96.4%	0.63	0.45 (0.13-0.77)	0.86	0.39 (0.03-0.76)
Overall	94.6%	0.76	0.74 (0.70-0.78)	0.88	0.76 (0.73-0.79)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation.

Table 2.

Comparison of Overall Status Agreement Between Human Reviewers

Comparison of Paired Human Reviewers on EIM Status Determination
EIM Status	Reviewer 1
	Not Active			Active
	Negated	Historical	Resolved	Improved	Worsened	Uncertain
Negated	166	2	8	3	1	11
Historical	8	516	11	14	1	108
Resolved	16	3	63	8	3	12
Improved	2	5	11	94	11	42
Worsened	2	2	1	12	114	43
Uncertain	3	10	10	21	21	344

Comparison of Paired Human Reviewers on EIM Status Determination
EIM Status	Reviewer 1
	Not Active			Active
	Negated	Historical	Resolved	Improved	Worsened	Uncertain
Negated	166	2	8	3	1	11
Historical	8	516	11	14	1	108
Resolved	16	3	63	8	3	12
Improved	2	5	11	94	11	42
Worsened	2	2	1	12	114	43
Uncertain	3	10	10	21	21	344

Comparison of Human Reviewer Detection and Agreement on EIM Status
EIM Type	Detection Accuracy	Specific EIM Agreement		General EIM Agreement
		Accuracy	Agreement	Accuracy	Agreement
		Accuracy	Kappa (95% CI)	Accuracy	Kappa (95% CI)
Arthritis	96.6%	0.80	0.74 (0.69-0.80)	0.89	0.78 (0.74-0.82)
Psoriasis	90.8%	0.72	0.75 (0.66-0.85)	0.81	0.62 (0.52-0.73)
Ocular EIM	96.4%	0.70	0.69 (0.58-0.80)	0.85	0.71 (0.62-0.78)
Erythema nodosum	87.3%	0.85	0.79 (0.62-0.96)	0.88	0.74 (0.57-0.91)
Pyoderma gangrenosum	92.3%	0.75	0.72 (0.53-0.91)	0.89	0.77 (0.61-0.93)
Hidradenitis suppurativa	96.4%	0.63	0.45 (0.13-0.77)	0.86	0.39 (0.03-0.76)
Overall	94.6%	0.76	0.74 (0.70-0.78)	0.88	0.76 (0.73-0.79)

Comparison of Human Reviewer Detection and Agreement on EIM Status
EIM Type	Detection Accuracy	Specific EIM Agreement		General EIM Agreement
		Accuracy	Agreement	Accuracy	Agreement
		Accuracy	Kappa (95% CI)	Accuracy	Kappa (95% CI)
Arthritis	96.6%	0.80	0.74 (0.69-0.80)	0.89	0.78 (0.74-0.82)
Psoriasis	90.8%	0.72	0.75 (0.66-0.85)	0.81	0.62 (0.52-0.73)
Ocular EIM	96.4%	0.70	0.69 (0.58-0.80)	0.85	0.71 (0.62-0.78)
Erythema nodosum	87.3%	0.85	0.79 (0.62-0.96)	0.88	0.74 (0.57-0.91)
Pyoderma gangrenosum	92.3%	0.75	0.72 (0.53-0.91)	0.89	0.77 (0.61-0.93)
Hidradenitis suppurativa	96.4%	0.63	0.45 (0.13-0.77)	0.86	0.39 (0.03-0.76)
Overall	94.6%	0.76	0.74 (0.70-0.78)	0.88	0.76 (0.73-0.79)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation.

When compressing EIM status to more general active vs inactive classification, reviewer agreement was similar to specific status classifications. Reviewers agreed on 85.8% EIM general activity status judgments across all EIMs, constituting very good agreement (κ = 0.76; 95% CI, 0.73-0.79). Again, agreement on general activity status also varied by EIM type, ranging between very good for arthritis activity (89.4%; κ = 0.78) to only fair for hidradenitis suppurativa (86.2%; κ = 0.39). Across all EIM mentions, 10.6% were classified as activity-uncertain by human reviewers due to either insufficient or conflicting EIM status information even after adjudication. Activity-uncertain status ranged from 4.9% (arthritis) to 31.1% (hidradenitis suppurativa). In addition, the dataset was skewed toward inactive EIMs (historic, negated, or resolved EIM mentions), comprising 72.9% of the reviewed EIM instances.

Performance of NLP EIM Status Prediction Compared With Human Reviewers

Compared with human reviewers, automated NLP detection of EIM mentions was nearly perfect, with an accuracy of 97.8% and a sensitivity and specificity of 0.961 and 0.991, respectively, compared with adjudicated human review. The NLP pipeline was unable to determine EIM status in 24.0% of EIMs and ranged from as low as 6.7% for erythema nodosum to 51.6% for hidradenitis suppurativa. NLP and human reviewers agreed on EIM uncertainty status in 61.3% of cases. The 1240 documents were automatically processed by the NLP pipeline in approximately 6 hours, compared with approximately 200 hours (10 minutes per document) required by each human reviewer.

NLP prediction of specific EIM status had fair performance compared with human reviewers across all EIMs, with an overall accuracy, sensitivity, specificity, and agreement of 92.4%, 0.77, 0.95, and κ = 0.77 (95% CI, 0.75-0.79), respectively (Table 3). The accuracy and agreement of the automated NLP pipeline for specific EIM status was similar to the agreement between paired human reviewers with overall EIM classification kappa of 0.77 vs 0.76, respectively. Similar to human reviewers, the automated NLP pipeline performance was lowest for ocular and hidradenitis EIMs (κ = 0.54 and κ = 0.56, respectively).

Table 3.

NLP Performance for Inferring Specific EIM Status

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.95	0.85	0.97	0.84 (0.81-0.87)
Psoriasis	287	87.9%	69.3%	0.91	0.73	0.95	0.76 (0.70-0.81)
Ocular EIM	292	78.8%	67.5%	0.85	0.54	0.91	0.54 (0.48-0.61)
Erythema nodosum	73	89.6%	86.2%	0.95	0.86	0.97	0.85 (0.77-0.93)
Pyoderma gangrenosum	66	82.8%	69.7%	0.96	0.86	0.97	0.92 (0.86-0.98)
Hidradenitis suppurativa	64	68.8%	48.4%	0.87	0.61	0.92	0.56 (0.37-0.74)
Overall	1707	89.4%	75.9%	0.92	0.77	0.95	0.77 (0.75-0.79)

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.95	0.85	0.97	0.84 (0.81-0.87)
Psoriasis	287	87.9%	69.3%	0.91	0.73	0.95	0.76 (0.70-0.81)
Ocular EIM	292	78.8%	67.5%	0.85	0.54	0.91	0.54 (0.48-0.61)
Erythema nodosum	73	89.6%	86.2%	0.95	0.86	0.97	0.85 (0.77-0.93)
Pyoderma gangrenosum	66	82.8%	69.7%	0.96	0.86	0.97	0.92 (0.86-0.98)
Hidradenitis suppurativa	64	68.8%	48.4%	0.87	0.61	0.92	0.56 (0.37-0.74)
Overall	1707	89.4%	75.9%	0.92	0.77	0.95	0.77 (0.75-0.79)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation; NLP, natural language processing.

Table 3.

NLP Performance for Inferring Specific EIM Status

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.95	0.85	0.97	0.84 (0.81-0.87)
Psoriasis	287	87.9%	69.3%	0.91	0.73	0.95	0.76 (0.70-0.81)
Ocular EIM	292	78.8%	67.5%	0.85	0.54	0.91	0.54 (0.48-0.61)
Erythema nodosum	73	89.6%	86.2%	0.95	0.86	0.97	0.85 (0.77-0.93)
Pyoderma gangrenosum	66	82.8%	69.7%	0.96	0.86	0.97	0.92 (0.86-0.98)
Hidradenitis suppurativa	64	68.8%	48.4%	0.87	0.61	0.92	0.56 (0.37-0.74)
Overall	1707	89.4%	75.9%	0.92	0.77	0.95	0.77 (0.75-0.79)

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.95	0.85	0.97	0.84 (0.81-0.87)
Psoriasis	287	87.9%	69.3%	0.91	0.73	0.95	0.76 (0.70-0.81)
Ocular EIM	292	78.8%	67.5%	0.85	0.54	0.91	0.54 (0.48-0.61)
Erythema nodosum	73	89.6%	86.2%	0.95	0.86	0.97	0.85 (0.77-0.93)
Pyoderma gangrenosum	66	82.8%	69.7%	0.96	0.86	0.97	0.92 (0.86-0.98)
Hidradenitis suppurativa	64	68.8%	48.4%	0.87	0.61	0.92	0.56 (0.37-0.74)
Overall	1707	89.4%	75.9%	0.92	0.77	0.95	0.77 (0.75-0.79)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation; NLP, natural language processing.

Regarding the prediction of general EIM status (active, inactive, uncertain), the NLP pipeline had an accuracy, sensitivity, and specificity of 94.1%, 0.92, and 0.95, respectively, with very good overall agreement to adjudicated human review (κ = 0.76; 95% CI, 0.74-0.79). NLP pipeline agreement with human reviewers ranged from excellent for arthritis (97.2%; κ = 0.84) and erythema nodosum (98.6%; κ = 0.82) to only fair for hidradenitis suppurativa (94.1%; κ = 0.47) (Table 4). Notably, human reviewers had similar poor agreement for hidradenitis suppurativa.

Table 4.

NLP Performance for Inferring General EIM Status

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.97	0.91	0.98	0.84 (0.81-0.88)
Psoriasis	287	87.9%	69.3%	0.93	0.93	0.92	0.72 (0.65-0.79)
Ocular EIM	292	78.8%	67.5%	0.86	0.87	0.86	0.56 (0.48-0.64)
Erythema Nodosum	73	89.6%	86.2%	0.99	0.99	0.99	0.82 (0.67-0.97)
Pyoderma gangrenosum	66	82.8%	69.7%	0.99	0.99	0.97	0.74 (0.63-0.86)
Hidradenitis suppurativa	64	68.8%	48.4%	0.94	0.96	0.85	0.47 (0.28-0.66)
Overall	1707	89.4%	75.9%	0.94	0.92	0.95	0.76 (0.74-0.79)

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.97	0.91	0.98	0.84 (0.81-0.88)
Psoriasis	287	87.9%	69.3%	0.93	0.93	0.92	0.72 (0.65-0.79)
Ocular EIM	292	78.8%	67.5%	0.86	0.87	0.86	0.56 (0.48-0.64)
Erythema Nodosum	73	89.6%	86.2%	0.99	0.99	0.99	0.82 (0.67-0.97)
Pyoderma gangrenosum	66	82.8%	69.7%	0.99	0.99	0.97	0.74 (0.63-0.86)
Hidradenitis suppurativa	64	68.8%	48.4%	0.94	0.96	0.85	0.47 (0.28-0.66)
Overall	1707	89.4%	75.9%	0.94	0.92	0.95	0.76 (0.74-0.79)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation; NLP, natural language processing.

Table 4.

NLP Performance for Inferring General EIM Status

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.97	0.91	0.98	0.84 (0.81-0.88)
Psoriasis	287	87.9%	69.3%	0.93	0.93	0.92	0.72 (0.65-0.79)
Ocular EIM	292	78.8%	67.5%	0.86	0.87	0.86	0.56 (0.48-0.64)
Erythema Nodosum	73	89.6%	86.2%	0.99	0.99	0.99	0.82 (0.67-0.97)
Pyoderma gangrenosum	66	82.8%	69.7%	0.99	0.99	0.97	0.74 (0.63-0.86)
Hidradenitis suppurativa	64	68.8%	48.4%	0.94	0.96	0.85	0.47 (0.28-0.66)
Overall	1707	89.4%	75.9%	0.94	0.92	0.95	0.76 (0.74-0.79)

EIM Type	Number of Instances	All EIM Predictions
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Agreement
		Adjudicated Certain	NLP Certain	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	925	95.1%	81.6%	0.97	0.91	0.98	0.84 (0.81-0.88)
Psoriasis	287	87.9%	69.3%	0.93	0.93	0.92	0.72 (0.65-0.79)
Ocular EIM	292	78.8%	67.5%	0.86	0.87	0.86	0.56 (0.48-0.64)
Erythema Nodosum	73	89.6%	86.2%	0.99	0.99	0.99	0.82 (0.67-0.97)
Pyoderma gangrenosum	66	82.8%	69.7%	0.99	0.99	0.97	0.74 (0.63-0.86)
Hidradenitis suppurativa	64	68.8%	48.4%	0.94	0.96	0.85	0.47 (0.28-0.66)
Overall	1707	89.4%	75.9%	0.94	0.92	0.95	0.76 (0.74-0.79)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation; NLP, natural language processing.

Performance of Using Administrative Codes for Defining the Presence or Absence of EIMs

To better understand the value of EIM detection using NLP compared with administrative codes, ICD-10 codes for EIMs entered at the time of the encounter were compared with human annotations for determination of active EIM status. Assuming the presence of a diagnostic code indicates the presence of an active EIM at a given time, administrative data had overall poor performance, with an accuracy, sensitivity, and specificity of 73.3%, 0.32, and 0.83, respectively (Table 5). Agreement between EIM general activity status and the presence or absence of EIM administrative codes was also poor (κ = 0.26; 95% CI, 0.20-0.32). This analysis indicates that using diagnostic codes to measure the presence of EIMs will fail to capture the majority of active EIMs in IBD patients.

Table 5.

Performance of Using Diagnostic Codes to Infer General EIM Activity Compared With Human Reviewers

EIM Type	Accuracy	Sensitivity	Specificity	Agreement
EIM Type	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	0.79	0.22	0.91	0.14 (0.03-0.25)
Psoriasis	0.53	0.62	0.48	0.09 (0.00-0.20)
Ocular EIM	0.74	0.16	0.98	0.17 (0.01-0.33)
Erythema Nodosum	0.85	0.67	0.87	0.37 (0.01-0.73)
Pyoderma gangrenosum	0.69	0.48	0.84	0.33 (0.10-0.56)
Hidradenitis suppurativa	0.69	0.71	0.62	0.25 (0.01-0.49)
Overall	0.73	0.32	0.83	0.26 (0.20-0.32)

EIM Type	Accuracy	Sensitivity	Specificity	Agreement
EIM Type	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	0.79	0.22	0.91	0.14 (0.03-0.25)
Psoriasis	0.53	0.62	0.48	0.09 (0.00-0.20)
Ocular EIM	0.74	0.16	0.98	0.17 (0.01-0.33)
Erythema Nodosum	0.85	0.67	0.87	0.37 (0.01-0.73)
Pyoderma gangrenosum	0.69	0.48	0.84	0.33 (0.10-0.56)
Hidradenitis suppurativa	0.69	0.71	0.62	0.25 (0.01-0.49)
Overall	0.73	0.32	0.83	0.26 (0.20-0.32)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation.

Table 5.

Performance of Using Diagnostic Codes to Infer General EIM Activity Compared With Human Reviewers

EIM Type	Accuracy	Sensitivity	Specificity	Agreement
EIM Type	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	0.79	0.22	0.91	0.14 (0.03-0.25)
Psoriasis	0.53	0.62	0.48	0.09 (0.00-0.20)
Ocular EIM	0.74	0.16	0.98	0.17 (0.01-0.33)
Erythema Nodosum	0.85	0.67	0.87	0.37 (0.01-0.73)
Pyoderma gangrenosum	0.69	0.48	0.84	0.33 (0.10-0.56)
Hidradenitis suppurativa	0.69	0.71	0.62	0.25 (0.01-0.49)
Overall	0.73	0.32	0.83	0.26 (0.20-0.32)

EIM Type	Accuracy	Sensitivity	Specificity	Agreement
EIM Type	Accuracy	Sensitivity	Specificity	Kappa (95% CI)
Arthritis	0.79	0.22	0.91	0.14 (0.03-0.25)
Psoriasis	0.53	0.62	0.48	0.09 (0.00-0.20)
Ocular EIM	0.74	0.16	0.98	0.17 (0.01-0.33)
Erythema Nodosum	0.85	0.67	0.87	0.37 (0.01-0.73)
Pyoderma gangrenosum	0.69	0.48	0.84	0.33 (0.10-0.56)
Hidradenitis suppurativa	0.69	0.71	0.62	0.25 (0.01-0.49)
Overall	0.73	0.32	0.83	0.26 (0.20-0.32)

Abbreviations: CI, confidence interval; EIM, extraintestinal manifestation.

Discussion

In this proof-of-concept study, we show the potential for using NLP analysis of clinical narratives to offer more detail describing EIMs in IBD. NLP outperforms diagnostic codes for both detecting EIMs and inferring their status. EIM detection and agreement was generally very good and demonstrated similar performance to the agreement between human reviewers. In addition, automated methods could estimate specific characterizations of EIMs with good performance. A significant number of EIM references did not have a discernable status, owing to either insufficient information or conflicting statuses that could not be resolved even following human adjudication. In situations in which EIM documentation was not ambiguous, automated methods were capable of specific EIM information extraction tasks.

Automated NLP methods did not perfectly replicate human performance and more frequently were unable to make a status determination: why? Common factors impacting NLP ability to detect EIM status included separation of the EIM with status descriptors referenced in distant parts of the document. In addition, EIM status would often be conflicting within the same document section, particularly in scenarios when the author referenced past, present, and possible futures of EIMs behavior within the same section. The NLP pipeline was not designed to specifically detect temporal or anatomic references. As a result, an EIM could be active in one anatomic location but resolved in another, presenting both a prediction and annotation challenge. NLP methods are often challenged on hedging and speculation by authors, as language cues fail to clearly convey whether a phrase is describing the current moment or a possible future.

In addition, though the section-priority method frequently resolves EIM status conflicts, we found that the optimal prioritization of document sections was not static. The priority or importance of document sections may itself be dynamic, with different EIMs or clinical scenarios being better served by context-driven section priorities. Determining the optimal priority level for the history of present illness vs physical exam vs assessment and plan document sections based on clinical scenario will be investigated in future work.

Some EIMs, including ocular and dermatologic EIMs, used unique descriptors of activity that were inconsistent and contributed to NLP status prediction failures. Finally, often authors would alternate between specific EIMs and general references that were undetected by the NLP pipeline. For example, an author may reference psoriasis without commenting on its behavior but in the next paragraph of text reference a rash containing status information indicating worsening. While a human reviewer can easily infer that the rash is referencing the psoriasis, NLP pipelines are challenged to connect those concepts and ensure that the co-reference is correct.

Enhanced detail on individual patients using existing text found in the EMR could substantially impact population-based research on EIMs and other concepts in IBD. As an example, studies examining the associations EIMs among anti-integrin vs anti-tumor necrosis factor (TNF) users have had conflicting results using clinical trial data compared with real world administrative datasets.^25-28 Ananthakrishnan et al⁹ used NLP to identify arthritis-related EIMs, reporting that vedolizumab users experienced more arthritis-related EIMs compared with anti-TNF users (46.1% vs 28.5%; P = .002). As in our study, they found that the sensitivity and specificity of NLP for detecting arthritis EIMs (0.83-0.92) was superior to ICD-9 diagnostic codes (0.52-0.89). In other work, NLP analysis of clinical notes doubled the detection of anti-TNF biologic use, increased detection of patients with fistula (12% vs 36%) or strictures (25% vs 40%), and improved detection of surgeries and hospitalizations compared with using medication lists, problem lists, and administrative data within a health system EMR.^29,30 These works highlight the added value of NLP over existing data sources for understanding granular features of disease how NLP information can modify outcome assessment, and the potential for future detailed datasets at population scale.

Several limitations impact immediate implementation of NLP applications and should be considered when interpreting these results. First, NLP pipelines must be generalizable, accounting for the myriad of variation in documentation styles, templates, and author experience for broad applicability. Our work incorporated over 80 different authors, helping support potential generalizability of NLP performance, though this is not a substitute for formal validation using documents from multiple medical centers. Second, while our document set likely reflected real-world EIM prevalence, the EIM types collected were unbalanced, with a strong bias toward inactive status classifications. Generating a balanced dataset would require the reading of tens of thousands of documents, which was not feasible at this proof-of-concept phase.

Though clinical documentation is often considered ground truth, the information in notes may not be reliable. The aim in the study was evaluate NLP accuracy and performance for extracting EIM detail as documented by clinician, though this does not ensure the “correctness” of the information found in the note. However, NLP could conceivably help validate diagnoses and reported assessments in individual notes by examining longitudinal sets of documents for clinical information such as EIMs. Future work will aim to utilize a more diverse set of documents, improve dataset balance of EIM type and status, and explore methods for EIM diagnosis verification.

Though in its infancy, we expect a near future where NLP is used to automatically collect and organize complex clinical descriptions of patients with IBD. NLP pipelines similar to those presented here are methodologically agnostic, with expectations of near identical performance in common EMR environments such as Cerner, Epic Systems, CPRS, and others. What will be required to make NLP systems a practical reality is rigorous validation of performance across a wide variety of authors by experience level, location, and local healthcare information technology environments. Present experimentation with NLP tools requires centers have expertise and resources for complex bioinformatics and specific resources for computational linguistics. However, in time NLP toolkits will be easily deployed within healthcare systems and are likely to be directly integrated into the EMR with minimal tuning and optimization required.

Conclusions

We demonstrate the performance of a pilot system aimed to both detect the presence and describe the status of EIMs in IBD. Analyses of population data currently using diagnostic, procedural, and admission codes are hindered by the absence of granular disease descriptors. NLP analysis of ubiquitous clinical documents stored in EMRs could capture the nuanced data separating the experiences of individual IBD patients. Beyond descriptions of EIMs explored in this article, NLP systems could be designed to extract other useful clinical detail regarding phenotype, medication use and tolerance, and prior history. Expect NLP tools to have a major impact in IBD care and researcher over the coming years.

Author Contributions

R.W.S. was involved in study design, data collection, data abstraction, statistical analysis, manuscript drafting. D.Y. was involved in data science and engineering, statistical analysis, manuscript drafting. X.Z. was involved in data science and engineering, statistical analysis, manuscript drafting. S.B. was involved in data collection, data abstraction, manuscript drafting. M.R. was involved in data collection, data abstraction, manuscript drafting. C.B. was involved in data collection, data abstraction, manuscript drafting. V.G.V.V. was involved in study design, data collection, data science and engineering, statistical analysis, manuscript drafting.

Funding

R.W.S. and V.G.V.V. received investigator-initiated study funding support from AbbVie.

Conflicts of Interest

R.W.S. has served as a consultant for, advisory board member for, or received research grants from AbbVie, Janssen, Takeda, Gilead, Eli Lilly, Merck, Exact Sciences, and CorEvitas. All remaining authors have no conflicts of interest relevant to this publication.

References

1.

Vavricka

SR

,

Brun

L

,

Ballabeni

P

, et al.

Frequency and risk factors for extraintestinal manifestations in the Swiss inflammatory bowel disease cohort

.

Am J Gastroenterol.

2011

;

106

:

110

–

119

.

2.

Ananthakrishnan

AN.

Epidemiology and risk factors for IBD

.

Nat Rev Gastroenterol Hepatol.

2015

;

12

:

205

–

217

.

3.

Harbord

M

,

Annese

V

,

Vavricka

SR

, et al.

The first European evidence-based consensus on extra-intestinal manifestations in inflammatory bowel disease

.

J Crohns Colitis.

2016

;

10

:

239

–

254

.

4.

Bottigliengo

D

,

Berchialla

P

,

Lanera

C

, et al.

The role of genetic factors in characterizing extra-intestinal manifestations in Crohn’s Disease patients: are bayesian machine learning methods improving outcome predictions?

J Clin Med.

2019

;

8:865

.

5.

Menti

E

,

Lanera

C

,

Lorenzoni

G

, et al.

Bayesian machine learning techniques for revealing complex interactions among genetic and clinical factors in association with extra-intestinal Manifestations in IBD patients

.

AMIA Annu Symp Proc.

2016

;

2016

:

884

–

893

.

PubMed

6.

van der Have

M

,

Brakenhoff

LKPM

,

van Erp

SJH

, et al.

Back/joint pain, illness perceptions and coping are important predictors of quality of life and work productivity in patients with inflammatory bowel disease: a 12-month longitudinal study

.

J Crohns Colitis.

2015

;

9

:

276

–

283

.

7.

Jansson

S

,

Malham

M

,

Paerregaard

A

, et al.

Extraintestinal manifestations are associated with disease severity in pediatric onset inflammatory bowel disease

.

J Pediatr Gastroenterol Nutr.

2020

;

71

:

40

–

45

.

8.

Patil

SA

,

Cross

RK.

Update in the management of extraintestinal manifestations of inflammatory bowel disease

.

Curr Gastroenterol Rep.

2013

;

15

:

314

.

9.

Ananthakrishnan

AN

,

Cai

T

,

Savova

G

, et al.

Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach

.

Inflamm Bowel Dis.

2013

;

19

:

1411

–

1420

.

10.

Bernstein

CN

,

Blanchard

JF

,

Rawsthorne

P

, et al.

The prevalence of extraintestinal diseases in inflammatory bowel disease: a population-based study

.

Am J Gastroenterol.

2001

;

96

:

1116

–

1122

.

11.

Yang

BR

,

Choi

N-K

,

Kim

M-S

, et al.

Prevalence of extraintestinal manifestations in Korean inflammatory bowel disease patients

.

PLoS One.

2018

;

13

:

e0200363

.

12.

Masanz

J

,

Pakhomov

SV

,

Xu

H

, et al.

Open source clinical NLP - more than any single system

.

AMIA Jt Summits Transl Sci Proc.

2014

;

2014

:

76

–

82

.

PubMed

13.

Soysal

E

,

Wang

J

,

Jiang

M

, et al.

CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines

.

J Am Med Inform Assoc.

2018

;

25

:

331

–

336

.

14.

Imler

TD

,

Morea

J

,

Kahi

C

, et al.

Multi-center colonoscopy quality measurement utilizing natural language processing

.

Am J Gastroenterol.

2015

;

110

:

543

–

552

.

15.

Imler

TD

,

Sherman

S

,

Imperiale

TF

, et al.

Provider-specific quality measurement for ERCP using natural language processing

.

Gastrointest Endosc.

2018

;

87

:

164

–

173.e2

.

16.

Van Vleck

TT

,

Chan

L

,

Coca

SG

, et al.

Augmented intelligence with natural language processing applied to electronic health records for identifying patients with non-alcoholic fatty liver disease at risk for disease progression

.

Int J Med Inform.

2019

;

129

:

334

–

341

.

17.

Nevin

L

;

PLOS Medicine Editors.

Advancing the beneficial use of machine learning in health care and medicine: Toward a community understanding.

PLoS Med

.

2018

;

15

:

e1002708

.

18.

Seyed Tabib

NS

,

Madgwick

M

,

Sudhakar

P

, et al.

Big data in IBD: big progress for clinical practice

.

Gut

.

2020

;

69

:

1520

–

1532

.

19.

Hou

JK

,

Tan

M

,

Stidham

RW

, et al.

Accuracy of diagnostic codes for identifying patients with ulcerative colitis and Crohn’s disease in the Veterans Affairs Health Care System

.

Dig Dis Sci

.

2014

;

59

:

2406

–

2410

.

20.

Bird

S

,

Klein

E

,

Loper

E.

Natural Language Processing with Python

.

O’Reilly Media, Inc.

;

2009

.

Google Preview

21.

Kang

T

,

Perotte

A

,

Tang

Y

, et al.

UMLS-based data augmentation for natural language processing of clinical research literature

.

J Am Med Inform Assoc.

2021

;

28

:

812

–

823

.

22.

Bodenreider

O.

The Unified Medical Language System (UMLS): integrating biomedical terminology

.

Nucleic Acids Res.

2004

;

32

:

D267

–

D270

.

23.

South

BR

,

Shen

S

,

Jones

M

, et al.

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

.

BMC Bioinformatics.

2009

;

10

(

Suppl 9

):

S12

.

PubMed