Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism

The study cohort characteristics.

Variable	All (n = 10 066)	Confirmed PEAE (n = 40)	Confirmed non-PEAE (n = 10 026)	P-value
Age (%)
<50	3022 (30.0)	11 (27.5)	3011 (30.0)	.931
50-64	2493 (24.8)	9 (22.5)	2484 (24.8)
65-74	1921 (19.1)	9 (22.5)	1912 (19.1)
≥75	2630 (26.1)	11 (27.5)	2619 (26.1)
Sex (%)
Female	5246 (52.1)	16 (40.0)	5230 (52.2)	.153
Male	4820 (47.9)	24 (60.0)	4796 (47.8)
Death in hospital (%)	717 (7.1)	5 (12.5)	712 (7.1)	.204
Number of comorbidities (%)
0	2416 (24.0)	6 (15.0)	2410 (24.0)	.352
1	2340 (23.2)	9 (22.5)	2331 (23.2)
≥2	5310 (52.8)	25 (62.5)	5285 (52.7)
Length of stay (%)
1-4 days	4881 (48.5)	4 (10.0)	4877 (48.6)	<.001
≥5 days	5185 (51.5)	36 (90.0)	5149 (51.4)

Variable	All (n = 10 066)	Confirmed PEAE (n = 40)	Confirmed non-PEAE (n = 10 026)	P-value
Age (%)
<50	3022 (30.0)	11 (27.5)	3011 (30.0)	.931
50-64	2493 (24.8)	9 (22.5)	2484 (24.8)
65-74	1921 (19.1)	9 (22.5)	1912 (19.1)
≥75	2630 (26.1)	11 (27.5)	2619 (26.1)
Sex (%)
Female	5246 (52.1)	16 (40.0)	5230 (52.2)	.153
Male	4820 (47.9)	24 (60.0)	4796 (47.8)
Death in hospital (%)	717 (7.1)	5 (12.5)	712 (7.1)	.204
Number of comorbidities (%)
0	2416 (24.0)	6 (15.0)	2410 (24.0)	.352
1	2340 (23.2)	9 (22.5)	2331 (23.2)
≥2	5310 (52.8)	25 (62.5)	5285 (52.7)
Length of stay (%)
1-4 days	4881 (48.5)	4 (10.0)	4877 (48.6)	<.001
≥5 days	5185 (51.5)	36 (90.0)	5149 (51.4)

Table 1.

The study cohort characteristics.

Variable	All (n = 10 066)	Confirmed PEAE (n = 40)	Confirmed non-PEAE (n = 10 026)	P-value
Age (%)
<50	3022 (30.0)	11 (27.5)	3011 (30.0)	.931
50-64	2493 (24.8)	9 (22.5)	2484 (24.8)
65-74	1921 (19.1)	9 (22.5)	1912 (19.1)
≥75	2630 (26.1)	11 (27.5)	2619 (26.1)
Sex (%)
Female	5246 (52.1)	16 (40.0)	5230 (52.2)	.153
Male	4820 (47.9)	24 (60.0)	4796 (47.8)
Death in hospital (%)	717 (7.1)	5 (12.5)	712 (7.1)	.204
Number of comorbidities (%)
0	2416 (24.0)	6 (15.0)	2410 (24.0)	.352
1	2340 (23.2)	9 (22.5)	2331 (23.2)
≥2	5310 (52.8)	25 (62.5)	5285 (52.7)
Length of stay (%)
1-4 days	4881 (48.5)	4 (10.0)	4877 (48.6)	<.001
≥5 days	5185 (51.5)	36 (90.0)	5149 (51.4)

Variable	All (n = 10 066)	Confirmed PEAE (n = 40)	Confirmed non-PEAE (n = 10 026)	P-value
Age (%)
<50	3022 (30.0)	11 (27.5)	3011 (30.0)	.931
50-64	2493 (24.8)	9 (22.5)	2484 (24.8)
65-74	1921 (19.1)	9 (22.5)	1912 (19.1)
≥75	2630 (26.1)	11 (27.5)	2619 (26.1)
Sex (%)
Female	5246 (52.1)	16 (40.0)	5230 (52.2)	.153
Male	4820 (47.9)	24 (60.0)	4796 (47.8)
Death in hospital (%)	717 (7.1)	5 (12.5)	712 (7.1)	.204
Number of comorbidities (%)
0	2416 (24.0)	6 (15.0)	2410 (24.0)	.352
1	2340 (23.2)	9 (22.5)	2331 (23.2)
≥2	5310 (52.8)	25 (62.5)	5285 (52.7)
Length of stay (%)
1-4 days	4881 (48.5)	4 (10.0)	4877 (48.6)	<.001
≥5 days	5185 (51.5)	36 (90.0)	5149 (51.4)

We evaluated two distinct filtering approaches to identify content relevant to PEAE. The first approach, a keyword-based filter, excluded chunks that lacked predefined PE-related keywords, reducing the total chunks across the dataset from 1 972 188 to 623 081 (Table 2). This filtering step decreased the median chunks per patient from 789 to 310.

Table 2.

The word counts of different strategies.

Filtering step	Total chunks across the dataset	Median number of chunks per patient (IQR)	Median total word count per patient (IQR)
Raw text (before filter)	1 972 188	789.5 (428.8-1569.5)	17189.0 (9123.8-34822.0)
Keyword-based filtered	623 081	310.0 (195.0-505.0)	6854.5 (4189.75-11372.0)
Semantic filtered	22 749	9.0 (2.0-19.3)	216.5 (53.0-506.0)

Filtering step	Total chunks across the dataset	Median number of chunks per patient (IQR)	Median total word count per patient (IQR)
Raw text (before filter)	1 972 188	789.5 (428.8-1569.5)	17189.0 (9123.8-34822.0)
Keyword-based filtered	623 081	310.0 (195.0-505.0)	6854.5 (4189.75-11372.0)
Semantic filtered	22 749	9.0 (2.0-19.3)	216.5 (53.0-506.0)

Table 2.

The word counts of different strategies.

Filtering step	Total chunks across the dataset	Median number of chunks per patient (IQR)	Median total word count per patient (IQR)
Raw text (before filter)	1 972 188	789.5 (428.8-1569.5)	17189.0 (9123.8-34822.0)
Keyword-based filtered	623 081	310.0 (195.0-505.0)	6854.5 (4189.75-11372.0)
Semantic filtered	22 749	9.0 (2.0-19.3)	216.5 (53.0-506.0)

Filtering step	Total chunks across the dataset	Median number of chunks per patient (IQR)	Median total word count per patient (IQR)
Raw text (before filter)	1 972 188	789.5 (428.8-1569.5)	17189.0 (9123.8-34822.0)
Keyword-based filtered	623 081	310.0 (195.0-505.0)	6854.5 (4189.75-11372.0)
Semantic filtered	22 749	9.0 (2.0-19.3)	216.5 (53.0-506.0)

The second approach applied a semantic similarity-based filter, utilizing cosine similarity algorithms to compare chunks against ten example sentences indicating PE conditions. The results of six different embedding models used in semantic filtering algorithms are presented in Supplementary Materials (see Table S1). The fine-tuned UAE embedding was used for all final evaluations. The reported triplet loss and accuracy values correspond to the validation set to ensure an unbiased assessment of the model’s performance. This semantic filtering method refined the dataset to 22 749 chunks, with a median of 9 chunks per patient and a median word count of 216 per patient. Both filtering strategies effectively reduced the dataset size while preserving clinically relevant PEAE information.

PEAE detection results

We conducted a series of experiments to evaluate the performance of four LLMs under different conditions. These models were assessed based on two primary factors: (1) the inclusion or exclusion of discharge information and (2) keyword-based (KW) versus semantic similarity-based (SS) retrieval methods. Table 3 presents the performance metrics of each model, benchmarked against the manual chart review, which served as the gold standard. Additionally, we compared the LLM-based approach to a rule-based method using the 10th Edition of the International Classification of Diseases (ICD-10) codes as a secondary baseline.

Table 3.

The PEAE detection performance from each model % (95% confidence interval).

Model	Retrieval	DCSM	PPV	Sensitivity	Specificity	NPV	F1 score
Gemma	KW	Yes	16.75 (11.85-21.92)	87.50 (76.32-97.14)	98.40 (98.16-98.63)	99.95 (99.91-99.99)	28.11 (20.91-35.48)
Gemma	KW	No	11.14 (7.87-14.53)	95.0 (86.96-100.00)	97.22 (96.9-97.52)	99.98 (99.95-100.0)	19.95 (14.53-25.41)
Gemma	SS	Yes	12.46 (8.86-16.16)	95.00 (86.96-100.00)	97.55 (97.25-97.83)	99.98 (99.95-100.0)	22.03 (16.28-27.83)
Gemma	SS	No	11.08 (7.83-14.49)	95.00 (87.18-100.0)	97.20 (96.88-97.5)	99.98 (99.95-100.0)	19.84 (14.55-25.13)
Llama3	KW	Yes	14.34 (10.04-18.88)	87.50 (76.19-97.22)	98.08 (97.81-98.34)	99.95 (99.91-99.99)	24.65 (18.12-31.27)
Llama3	KW	No	8.85 (6.33-11.57)	100.00 (100.0-100.0)	96.21 (95.86-96.56)	100.0 (100.0-100.0)	16.26 (11.84-20.75)
Llama3	SS	Yes	11.34 (8.12-14.88)	97.50 (91.49-100.0)	97.2 (96.88-97.51)	99.99 (99.97-100.0)	20.31 (15.05-25.77)
Llama3	SS	No	8.52 (6.07-11.31)	97.50 (91.67-100.0)	96.15 (95.78-96.5)	99.99 (99.97-100.0)	15.66 (11.32-19.97)
Mistral	KW	Yes	14.04 (10.08-18.21)	100.0 (100.0-100.0)	97.75 (97.46-98.02)	100.0 (100.0-100.0)	24.62 (18.36-30.94)
Mistral	KW	No	11.14 (7.97-14.55)	97.50 (91.67-100.0)	97.14 (96.83-97.45)	99.99 (99.97-100.0)	20.0 (14.73-25.29)
Mistral	SS	Yes	11.63 (8.31-15.15)	100.0 (100.0-100.0)	97.21 (96.9-97.52)	100.0 (100.0-100.0)	20.83 (15.38-26.25)
Mistral	SS	No	11.08 (7.9-14.44)	97.50 (91.67-100.0)	97.12 (96.8-97.43)	99.99 (99.97-100.0)	19.9 (14.58-25.13)
Phi3	KW	Yes	15.27 (9.38-21.54)	50.00 (34.21-65.71)	98.98 (98.79-99.16)	99.81 (99.73-99.89)	23.39 (14.81-31.85)
Phi3	KW	No	12.02 (7.47-16.95)	55.00 (39.47-70.59)	98.52 (98.29-98.74)	99.83 (99.75-99.91)	19.73 (12.77-26.72)
Phi3	SS	Yes	9.32 (6.68-12.11)	100.00 (100.0-100.0)	96.42 (96.08-96.77)	100.0 (100.0-100.0)	17.06 (12.47-21.65)
Phi3	SS	No	6.76 (4.83-8.88)	100.00 (100.0-100.0)	94.93 (94.5-95.33)	100.0 (100.0-100.0)	12.66 (9.16-16.18)
ICD-based method (Baseline)			37.5 (19.1-58.3)	14.52 (6.5-23.7)	99.85 (99.8-99.9)	99.48 (99.3-99.6)	20.93 (9.9-31.9)

Model	Retrieval	DCSM	PPV	Sensitivity	Specificity	NPV	F1 score
Gemma	KW	Yes	16.75 (11.85-21.92)	87.50 (76.32-97.14)	98.40 (98.16-98.63)	99.95 (99.91-99.99)	28.11 (20.91-35.48)
Gemma	KW	No	11.14 (7.87-14.53)	95.0 (86.96-100.00)	97.22 (96.9-97.52)	99.98 (99.95-100.0)	19.95 (14.53-25.41)
Gemma	SS	Yes	12.46 (8.86-16.16)	95.00 (86.96-100.00)	97.55 (97.25-97.83)	99.98 (99.95-100.0)	22.03 (16.28-27.83)
Gemma	SS	No	11.08 (7.83-14.49)	95.00 (87.18-100.0)	97.20 (96.88-97.5)	99.98 (99.95-100.0)	19.84 (14.55-25.13)
Llama3	KW	Yes	14.34 (10.04-18.88)	87.50 (76.19-97.22)	98.08 (97.81-98.34)	99.95 (99.91-99.99)	24.65 (18.12-31.27)
Llama3	KW	No	8.85 (6.33-11.57)	100.00 (100.0-100.0)	96.21 (95.86-96.56)	100.0 (100.0-100.0)	16.26 (11.84-20.75)
Llama3	SS	Yes	11.34 (8.12-14.88)	97.50 (91.49-100.0)	97.2 (96.88-97.51)	99.99 (99.97-100.0)	20.31 (15.05-25.77)
Llama3	SS	No	8.52 (6.07-11.31)	97.50 (91.67-100.0)	96.15 (95.78-96.5)	99.99 (99.97-100.0)	15.66 (11.32-19.97)
Mistral	KW	Yes	14.04 (10.08-18.21)	100.0 (100.0-100.0)	97.75 (97.46-98.02)	100.0 (100.0-100.0)	24.62 (18.36-30.94)
Mistral	KW	No	11.14 (7.97-14.55)	97.50 (91.67-100.0)	97.14 (96.83-97.45)	99.99 (99.97-100.0)	20.0 (14.73-25.29)
Mistral	SS	Yes	11.63 (8.31-15.15)	100.0 (100.0-100.0)	97.21 (96.9-97.52)	100.0 (100.0-100.0)	20.83 (15.38-26.25)
Mistral	SS	No	11.08 (7.9-14.44)	97.50 (91.67-100.0)	97.12 (96.8-97.43)	99.99 (99.97-100.0)	19.9 (14.58-25.13)
Phi3	KW	Yes	15.27 (9.38-21.54)	50.00 (34.21-65.71)	98.98 (98.79-99.16)	99.81 (99.73-99.89)	23.39 (14.81-31.85)
Phi3	KW	No	12.02 (7.47-16.95)	55.00 (39.47-70.59)	98.52 (98.29-98.74)	99.83 (99.75-99.91)	19.73 (12.77-26.72)
Phi3	SS	Yes	9.32 (6.68-12.11)	100.00 (100.0-100.0)	96.42 (96.08-96.77)	100.0 (100.0-100.0)	17.06 (12.47-21.65)
Phi3	SS	No	6.76 (4.83-8.88)	100.00 (100.0-100.0)	94.93 (94.5-95.33)	100.0 (100.0-100.0)	12.66 (9.16-16.18)
ICD-based method (Baseline)			37.5 (19.1-58.3)	14.52 (6.5-23.7)	99.85 (99.8-99.9)	99.48 (99.3-99.6)	20.93 (9.9-31.9)

This table summarizes the performance metrics of 4 models using different methods. The results were based on 10 000 bootstrap resamples to estimate the performance variability, with the metrics reported as the mean and 95% confidence intervals. The baseline method is a rule-based method based on ICD codes from a Discharge Abstract Database.

Abbreviations used: DCSM, inclusion of discharge information; ICD, International Classification of Diseases; KW: keyword-based retrieval methods; NPV, negative predictive value; PPV, positive predictive value; SS, semantic similarity-based retrieval methods.

Table 3.

The PEAE detection performance from each model % (95% confidence interval).

Model	Retrieval	DCSM	PPV	Sensitivity	Specificity	NPV	F1 score
Gemma	KW	Yes	16.75 (11.85-21.92)	87.50 (76.32-97.14)	98.40 (98.16-98.63)	99.95 (99.91-99.99)	28.11 (20.91-35.48)
Gemma	KW	No	11.14 (7.87-14.53)	95.0 (86.96-100.00)	97.22 (96.9-97.52)	99.98 (99.95-100.0)	19.95 (14.53-25.41)
Gemma	SS	Yes	12.46 (8.86-16.16)	95.00 (86.96-100.00)	97.55 (97.25-97.83)	99.98 (99.95-100.0)	22.03 (16.28-27.83)
Gemma	SS	No	11.08 (7.83-14.49)	95.00 (87.18-100.0)	97.20 (96.88-97.5)	99.98 (99.95-100.0)	19.84 (14.55-25.13)
Llama3	KW	Yes	14.34 (10.04-18.88)	87.50 (76.19-97.22)	98.08 (97.81-98.34)	99.95 (99.91-99.99)	24.65 (18.12-31.27)
Llama3	KW	No	8.85 (6.33-11.57)	100.00 (100.0-100.0)	96.21 (95.86-96.56)	100.0 (100.0-100.0)	16.26 (11.84-20.75)
Llama3	SS	Yes	11.34 (8.12-14.88)	97.50 (91.49-100.0)	97.2 (96.88-97.51)	99.99 (99.97-100.0)	20.31 (15.05-25.77)
Llama3	SS	No	8.52 (6.07-11.31)	97.50 (91.67-100.0)	96.15 (95.78-96.5)	99.99 (99.97-100.0)	15.66 (11.32-19.97)
Mistral	KW	Yes	14.04 (10.08-18.21)	100.0 (100.0-100.0)	97.75 (97.46-98.02)	100.0 (100.0-100.0)	24.62 (18.36-30.94)
Mistral	KW	No	11.14 (7.97-14.55)	97.50 (91.67-100.0)	97.14 (96.83-97.45)	99.99 (99.97-100.0)	20.0 (14.73-25.29)
Mistral	SS	Yes	11.63 (8.31-15.15)	100.0 (100.0-100.0)	97.21 (96.9-97.52)	100.0 (100.0-100.0)	20.83 (15.38-26.25)
Mistral	SS	No	11.08 (7.9-14.44)	97.50 (91.67-100.0)	97.12 (96.8-97.43)	99.99 (99.97-100.0)	19.9 (14.58-25.13)
Phi3	KW	Yes	15.27 (9.38-21.54)	50.00 (34.21-65.71)	98.98 (98.79-99.16)	99.81 (99.73-99.89)	23.39 (14.81-31.85)
Phi3	KW	No	12.02 (7.47-16.95)	55.00 (39.47-70.59)	98.52 (98.29-98.74)	99.83 (99.75-99.91)	19.73 (12.77-26.72)
Phi3	SS	Yes	9.32 (6.68-12.11)	100.00 (100.0-100.0)	96.42 (96.08-96.77)	100.0 (100.0-100.0)	17.06 (12.47-21.65)
Phi3	SS	No	6.76 (4.83-8.88)	100.00 (100.0-100.0)	94.93 (94.5-95.33)	100.0 (100.0-100.0)	12.66 (9.16-16.18)
ICD-based method (Baseline)			37.5 (19.1-58.3)	14.52 (6.5-23.7)	99.85 (99.8-99.9)	99.48 (99.3-99.6)	20.93 (9.9-31.9)

Model	Retrieval	DCSM	PPV	Sensitivity	Specificity	NPV	F1 score
Gemma	KW	Yes	16.75 (11.85-21.92)	87.50 (76.32-97.14)	98.40 (98.16-98.63)	99.95 (99.91-99.99)	28.11 (20.91-35.48)
Gemma	KW	No	11.14 (7.87-14.53)	95.0 (86.96-100.00)	97.22 (96.9-97.52)	99.98 (99.95-100.0)	19.95 (14.53-25.41)
Gemma	SS	Yes	12.46 (8.86-16.16)	95.00 (86.96-100.00)	97.55 (97.25-97.83)	99.98 (99.95-100.0)	22.03 (16.28-27.83)
Gemma	SS	No	11.08 (7.83-14.49)	95.00 (87.18-100.0)	97.20 (96.88-97.5)	99.98 (99.95-100.0)	19.84 (14.55-25.13)
Llama3	KW	Yes	14.34 (10.04-18.88)	87.50 (76.19-97.22)	98.08 (97.81-98.34)	99.95 (99.91-99.99)	24.65 (18.12-31.27)
Llama3	KW	No	8.85 (6.33-11.57)	100.00 (100.0-100.0)	96.21 (95.86-96.56)	100.0 (100.0-100.0)	16.26 (11.84-20.75)
Llama3	SS	Yes	11.34 (8.12-14.88)	97.50 (91.49-100.0)	97.2 (96.88-97.51)	99.99 (99.97-100.0)	20.31 (15.05-25.77)
Llama3	SS	No	8.52 (6.07-11.31)	97.50 (91.67-100.0)	96.15 (95.78-96.5)	99.99 (99.97-100.0)	15.66 (11.32-19.97)
Mistral	KW	Yes	14.04 (10.08-18.21)	100.0 (100.0-100.0)	97.75 (97.46-98.02)	100.0 (100.0-100.0)	24.62 (18.36-30.94)
Mistral	KW	No	11.14 (7.97-14.55)	97.50 (91.67-100.0)	97.14 (96.83-97.45)	99.99 (99.97-100.0)	20.0 (14.73-25.29)
Mistral	SS	Yes	11.63 (8.31-15.15)	100.0 (100.0-100.0)	97.21 (96.9-97.52)	100.0 (100.0-100.0)	20.83 (15.38-26.25)
Mistral	SS	No	11.08 (7.9-14.44)	97.50 (91.67-100.0)	97.12 (96.8-97.43)	99.99 (99.97-100.0)	19.9 (14.58-25.13)
Phi3	KW	Yes	15.27 (9.38-21.54)	50.00 (34.21-65.71)	98.98 (98.79-99.16)	99.81 (99.73-99.89)	23.39 (14.81-31.85)
Phi3	KW	No	12.02 (7.47-16.95)	55.00 (39.47-70.59)	98.52 (98.29-98.74)	99.83 (99.75-99.91)	19.73 (12.77-26.72)
Phi3	SS	Yes	9.32 (6.68-12.11)	100.00 (100.0-100.0)	96.42 (96.08-96.77)	100.0 (100.0-100.0)	17.06 (12.47-21.65)
Phi3	SS	No	6.76 (4.83-8.88)	100.00 (100.0-100.0)	94.93 (94.5-95.33)	100.0 (100.0-100.0)	12.66 (9.16-16.18)
ICD-based method (Baseline)			37.5 (19.1-58.3)	14.52 (6.5-23.7)	99.85 (99.8-99.9)	99.48 (99.3-99.6)	20.93 (9.9-31.9)

Overall, none of the LLMs fully replicated the results of manual chart review, but they demonstrated varying levels of agreement with it across different retrieval conditions. Among the tested models, Gemma (keyword-based retrieval with discharge summaries) achieved the highest F1-score (28.11%), with a sensitivity of 87.50% and specificity of 98.40%. While this represents an improvement over other LLM configurations, the PPV remained relatively low across all models (ranging from 6.76% to 16.75%), suggesting that LLMs still generate a substantial number of false positives compared to manual abstraction.

The inclusion of discharge summaries generally improved F1 scores across models, likely due to the presence of more structured clinical narratives. For instance, for the Mistral model with KW retrieval, the F1 score increased from 20.00% (without discharge information) to 24.62% (with discharge information). Similarly, in Llama3, the F1 score improved from 16.26% to 24.65% when discharge summaries were included. Additionally, we compared the performance of KW retrieval with SS retrieval. The KW retrieval method generally outperformed SS retrieval across different models, particularly in F1 score and PPV. However, SS retrieval often achieved comparable or higher sensitivity, indicating that it captures more potential cases but at the cost of increased false positives.

Compared to the ICD-10-based method, LLMs achieved higher sensitivity (50% to 100%) but had lower PPV. The ICD-10-based approach had a higher PPV (37.5%) but much lower sensitivity (14.52%), missing many true cases. This tradeoff highlights the strengths and limitations of each method.

Table 4 summarizes the performance of the four LLMs in detecting adverse events using a yearly granularity, reflecting the models’ effectiveness in population-based surveillance and assessing how well they captured trends in PEAE incidence over time. We evaluated the models’ ability to track PEAE trends over time at different granularities. Table 4 summarizes the performance of each model, highlighting the highest correlations and lowest errors at the yearly granularity.

Table 4.

Model performance for population-level surveillance at year granularity.

Model	Correlation	Mean squared error	R²
Llama3	0.988	0.003	0.975
Phi3	0.978	0.005	0.953
Mistral	0.959	0.009	0.918
Gemma	0.921	0.019	0.828

Table 4.

Model performance for population-level surveillance at year granularity.

Model	Correlation	Mean squared error	R²
Llama3	0.988	0.003	0.975
Phi3	0.978	0.005	0.953
Mistral	0.959	0.009	0.918
Gemma	0.921	0.019	0.828

Discussion

The results of our method for PEAE detection using EMR notes demonstrated that all tested LLMs achieved high sensitivity and specificity, indicating a powerful ability to distinguish PEAE cases from non-PEAE cases. Additionally, their high sensitivity and NPV across all four models underscore their effectiveness as filtering tools, ensuring the retention of as many positive cases as possible while confidently excluding negative cases.

When comparing the KW chunking selection method with the semantic SS method, we found that the KW approach consistently achieved higher F1-scores and PPV. Our analysis showed that domain-specific keywords were highly effective in selecting relevant chunks of EMR notes for PEAE detection. For instance, KW chunk filtering achieved a higher F1 score (28.11% vs 22.03%) for the Gemma model compared with semantic similarity-based filtering. These results underscore the importance of incorporating domain knowledge into the chunk selection process for optimal performance.

We also observed that including the discharge information further enhanced the metrics, particularly the F1 score. It provided additional context that better captured relevant information and reduced misclassification.

The population-based analysis underscores the challenges of detecting trends in low-incidence-rate conditions like PEAE. While all LLM models demonstrated strong performance at the a yearly granularity, finer granularities such as monthly trends revealed greater variability in predictions. This variability is driven by the sparsity of PEAE cases in the dataset, which amplifies the impact of small fluctuations on model predictions. Larger aggregates at the yearly level reduce this variability, resulting in closer alignment between predicted and actual trends.

These findings suggest that the framework is well suited for identifying long-term trends in population-level surveillance, even in settings with imbalanced datasets. Future research could explore further refinements to enhance the framework’s performance for finer-grained analysis.

Limitations

Almost all errors came from false positive cases. We further investigated the errors generated by our LLM-based PEAE detection framework and noticed several primary error types. These errors highlight the limitations faced by the proposed framework and tested LLMs in accurately interpreting complex medical information.

Contextual misinterpretation

The most common errors stemmed from the LLMs’ difficulty in accurately interpreting clinical context. This issue appeared in several ways. Temporal confusion led LLMs to misinterpret historical medical events as current conditions. Diagnostic uncertainty resulted in suspected conditions being frequently misclassified as confirmed diagnoses.

Limited capacity in compressing terminologies

We identified errors that reflected the framework’s medical knowledge and terminology limitations. Similar medical terms were occasionally conflated, such as misinterpreting pulmonary edema as PE. In addition, LLMs sometimes fail to interpret specialized sections of medical notes correctly. For example, in the “History of Present Illness” section, a past diagnosis of deep vein thrombosis was misinterpreted as evidence of current PE, despite contextual cues indicating successful treatment and no recent clot formation. Furthermore, certain treatments were overgeneralized as indicators of specific conditions, not accounting for their use in multiple medical contexts, such as interpreting the use of anticoagulants for PE prophylaxis as evidence of a confirmed diagnosis.

Near miss and implicit cases

The chart review involves reading the clinician’s notes to identify diagnoses and confirm the presence or absence of PEAE without making assumptions. Our chart reviewers determined PEAE using the information available in the chart. Unfortunately, sometimes, the chart may have insufficient information to determine whether a PEAE occurred. There were cases where patients received empiric therapy for suspected PE, but the diagnosis was not confirmed through documented follow-up diagnostics. Such cases were labeled as negative by the chart reviewers, as they relied solely on explicitly documented information and could not infer a diagnosis without confirmation explicitly noted in the chart.

Preprocessing and discharge information extraction

There are technical limitations in the preprocessing of medical data. Specifically, both keyword and semantic chunk search algorithms fail to preserve semantically consistent segments, leading to incorrect grouping of information. Additionally, the discharge information extraction module occasionally overfocused on primary diagnoses while omitting crucial secondary information, distorting the overall clinical picture.

Prompt engineering

The prompting strategies used in this study were designed to elicit binary outputs followed by brief explanations. While effective, future work could explore advanced techniques such as chain-of-thought prompting,⁴³ which structures the explanation before the final response, potentially improving the interpretability and performance of the outputs. Additionally, few-shot prompting,⁴⁴^,⁴⁵ may enhance the ability of LLMs to generalize across diverse scenarios. These adaptations could further optimize performance and robustness in detecting PEAE and other adverse events.

Conclusion

Our proposed method demonstrates the strong potential of LLMs for detecting PEAE, with high sensitivity and specificity scores indicating effective discrimination between PEAE and non-PEAE cases across various classification thresholds. Among the 4 LLMs evaluated, Gemma achieved the highest F1 score, suggesting better overall balance between PPV and sensitivity in identifying PEAE cases from narrative EMR data. While all 4 models achieved high specificity and sensitivity, the low prevalence of PEAE contributed to a relatively high number of false positives, leading to a lower PPV. This suggests that these models cannot be relied upon solely to predict or select PEAE cases. Further chart review is necessary to confirm PEAE after initial LLM screening. Looking ahead, refining LLMs to enhance precision in detecting PEAE and other adverse events from EMRs will be essential as EMRs continue to be widely implemented worldwide.

Author contributions

Cheligeer Cheligeer (Writing – original draft, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization), Danille A. Southern (Funding acquisition, Project administration, Validation, Writing – review & editing), Jun Yan (Methodology, Validation, Writing – review & editing), Jie Pan (Methodology, Writing – review & editing), Seungwon Lee (Data curation, Writing – review & editing), Elliot Martin (Data curation, Writing – review & editing), Hamed Jafarpour (Methodology, Writing – review & editing), Cathy Eastwood (Funding acquisition, Project administration, Resources, Writing – review & editing), Yong Zeng (Conceptualization, Funding acquisition, Methodology, Resources, Supervision, Writing – review & editing), and Hude Quan (Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – review & editing)

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the Canadian Institutes of Health Research (grant number DC0190GP).

Conflicts of interest

The authors declare no competing interests.

Data availability

Owing to data-sharing policies of the data custodians, the dataset cannot be made publicly available. The complete code for this study is available from the corresponding author upon academically reasonable request.

Ethics approval and consent to participate

This study was approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB21-0416). Patient consent was waived as part of the ethics board review process.

Consent for publication

Not applicable as this manuscript does not contain any individual person’s data in any form that could be used to identify them.

References

Ford

Carroll

Smith

Scott

Cassell

JA.

Extracting information from the text of electronic medical records to improve case detection: a systematic review

J Am Med Inform Assoc

2016

;

1007

1015

Goldstein

Kirkendall

Nguyen

, et al.

Electronic health record identification of nephrotoxin exposure and associated acute kidney injury

Pediatrics

2013

;

132

e756-767

e767

10.1542/peds.2013-0794

Cheligeer

Lee

, et al.

BERT-based neural network for inpatient fall detection from electronic medical records: retrospective cohort study

JMIR Med Inform

2024

;

e48995

Zhao

Zhang

Chen

, et al.

A machine-learning-based approach for identifying diagnostic errors in electronic medical records

IEEE Trans Rel

2024

;

1172

1186

10.1109/Tr.2023.3330733

10.1161/Hypertensionaha.109.139279

Quan

Khan

Hemmelgarn

, et al. ;

Hypertension Outcome and Surveillance Team of the Canadian Hypertension Education Programs

Validation of a case definition to define hypertension using administrative data

Hypertension

2009

;

1423

1428

Lee

Martin

, et al.

Enhancing ICD-code-based case definition for heart failure using electronic medical record data

J Card Fail

2020

;

610

617

10.1016/j.cardfail.2020.04.003

Fleischmann-Struzek

Thomas-Rüddel

Schettler

, et al.

Comparing the validity of different ICD coding abstraction strategies for sepsis case identification in German claims data

PLoS One

2018

;

e0198847

10.1371/journal.pone.0198847

Jolley

Quan

Jetté

, et al.

Validation and optimisation of an ICD-10-coded case definition for sepsis using administrative health data

BMJ Open

2015

;

e009487

10.1136/bmjopen-2015-009487

McBrien

Souri

Symonds

, et al.

Identification of validated case definitions for medical conditions used in primary care electronic medical record databases: a systematic review

J Am Med Inform Assoc

2018

;

1567

1578

Jette

Reid

Quan

Hill

Wiebe

How accurate is ICD coding for epilepsy?

Epilepsia

2010

;

10.1111/j.1528-1167.2009.02201.x

Sheikhalishahi

Miotto

Dudley

, et al.

Natural language processing of clinical notes on chronic diseases: systematic review

JMIR Med Inform

2019

;

e12239

Koleck

Dreisbach

Bourne

Bakken

Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review

J Am Med Inform Assoc

2019

;

364

379

Sandhu

Krusina

Quan

, et al.

Automated extraction of weight, height, and obesity in electronic medical records are highly valid

Obes Sci Pract

2024

;

e705

Tapson

VF.

Acute pulmonary embolism

N Engl J Med

2008

;

358

1037

1052

Thrombosis Canada

. Pulmonary Embolism: Diagnosis and Management. Thrombosis Canada; 2013. https://thrombosiscanada.ca/guides/pdfs/PE.pdf

Melton

Hripcsak

Automated detection of adverse events using natural language processing of discharge summaries

J Am Med Inform Assoc

2005

;

448

457

Llama Team, AI @ Meta

. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783. 2024. https://arxiv.org/abs/2407.21783

Jiang

Sablayrolles

Mensch

, et al. Mistral 7B. arXiv preprint arXiv:2310.06825. 2023. https://arxiv.org/abs/2310.06825

Mesnard

Hardin

Dadashi

, et al. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295. 2024. https://arxiv.org/abs/2403.08295

Abdin

Aneja

Awadalla

, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219. 2024. https://arxiv.org/abs/2404.14219

Eastwood

Sapiro

, et al.

Achieving high inter-rater reliability in establishing data labels: a retrospective chart review study

BMJ Open Qual

2024

;

:e002722.

10.1136/bmjoq-2023-002722

OpenURL Placeholder Text

10.1371/journal.pone.0275250

Eastwood

Zeng

, et al.

Developing EMR-based algorithms to Identify hospital adverse events for health system performance evaluation and improvement: study protocol

PLoS One

2022

;

e0275250

Friedman

Provan

Moore

Hanneman

Errors, near misses and adverse events in the emergency department: what can patients tell us?

CJEM

2008

;

421

427

10.1017/s1481803500010484

Heit

Melton

Lohse

, et al.

Incidence of venous thromboembolism in hospitalized patients vs community residents

Mayo Clin Proc

2001

;

1102

1110

Gee

The National VTE Exemplar Centres Network response to implementation of updated NICE guidance: venous thromboembolism in over 16s: reducing the risk of hospital-acquired deep vein thrombosis or pulmonary embolism (NG89)

Br J Haematol

2019

;

186

792

793

Zhang

Chan

Cheng

SW.

Prevalence and risk factors of hospital acquired venous thromboembolism

Phlebology

2024

2683555241297566

10.1177/02683555241297566

OpenURL Placeholder Text

Levy

Jacoby

Goldberg

2024

. Same task, more tokens: the impact of input length on the reasoning performance of large language models. In: Ku L-W, Martins A, Srikumar V, eds. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024 Aug, Bangkok, Thailand. Association for Computational Linguistics; 2024:

15339

15353

Reimers

Gurevych

. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019 Nov, Hong Kong, China. 2019:

3982

3992

Devlin

Chang

M-W

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, eds. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019 Jun, Minneapolis, MN. Stroudsburg, PA: Association for Computational Linguistics; 2019:4171-4186.

Luo

Liu

Xiao

, et al. Landmark embedding: a chunking-free embedding method for retrieval augmented long-context large language models. In: Ku L-W, Martins A, Srikumar V, eds. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024 Aug, Bangkok, Thailand. Stroudsburg (PA): Association for Computational Linguistics; 2024:3268-3281.

Alsentzer

Murphy

Boag

, et al. Publicly available clinical BERT embeddings. In: Rumshisky A, Roberts K, Bethard S, Naumann T, eds. Proceedings of the 2nd Clinical Natural Language Processing Workshop, 2019 Jun 7, Minneapolis, MN. Stroudsburg, PA: Association for Computational Linguistics; 2019:

Wang

Yang

Huang

, et al. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533. 2022. https://arxiv.org/abs/2212.03533

Zhang

, et al. Towards general text embeddings with multi-stage contrastive learning. arXiv, arXiv:2308.03281. 2023. https://arxiv.org/abs/2308.03281

Song

Tan

Qin

Liu

T-Y.

Mpnet: masked and permuted pre-training for language understanding

Adv Neural Inform Process Syst

2020

;

16857

16867

OpenURL Placeholder Text