Generalization of finetuned transformer language models to new clinical contexts

Annotator agreement

	Epileptologist notes (mean)^a	Neurologist notes	Non-neurologist notes
Seizure Freedom Classification Agreement (⁠ $κ$ ⁠)	0.82	0.82	0.74
Overall Extraction Agreement (F₁)	0.44	0.47	0.42
Paired Extraction Agreement (F₁)	0.79	0.85	0.93

	Epileptologist notes (mean)^a	Neurologist notes	Non-neurologist notes
Seizure Freedom Classification Agreement (⁠ $κ$ ⁠)	0.82	0.82	0.74
Overall Extraction Agreement (F₁)	0.44	0.47	0.42
Paired Extraction Agreement (F₁)	0.79	0.85	0.93

From our previous study.¹³

Table 1.

Annotator agreement

	Epileptologist notes (mean)^a	Neurologist notes	Non-neurologist notes
Seizure Freedom Classification Agreement (⁠ $κ$ ⁠)	0.82	0.82	0.74
Overall Extraction Agreement (F₁)	0.44	0.47	0.42
Paired Extraction Agreement (F₁)	0.79	0.85	0.93

	Epileptologist notes (mean)^a	Neurologist notes	Non-neurologist notes
Seizure Freedom Classification Agreement (⁠ $κ$ ⁠)	0.82	0.82	0.74
Overall Extraction Agreement (F₁)	0.44	0.47	0.42
Paired Extraction Agreement (F₁)	0.79	0.85	0.93

From our previous study.¹³

Table 2.

Statistics of notes from the University of Pennsylvania

Gold standard annotations	Epileptologist (n = 1000, 700 training, 300 validation)^a	Neurologist (n = 100)	Non-neurologist (n = 100)
Classification
Seizure-free	30%	33%	30%
Not seizure-free	62%	47%	35%
Unclassified	8%	20%	35%
Note contained seizure frequency text	36%	14%	7%
Note contained date of last seizure text	50%	48%	36%

Gold standard annotations	Epileptologist (n = 1000, 700 training, 300 validation)^a	Neurologist (n = 100)	Non-neurologist (n = 100)
Classification
Seizure-free	30%	33%	30%
Not seizure-free	62%	47%	35%
Unclassified	8%	20%	35%
Note contained seizure frequency text	36%	14%	7%
Note contained date of last seizure text	50%	48%	36%

From our previous study.¹³

Table 2.

Statistics of notes from the University of Pennsylvania

Gold standard annotations	Epileptologist (n = 1000, 700 training, 300 validation)^a	Neurologist (n = 100)	Non-neurologist (n = 100)
Classification
Seizure-free	30%	33%	30%
Not seizure-free	62%	47%	35%
Unclassified	8%	20%	35%
Note contained seizure frequency text	36%	14%	7%
Note contained date of last seizure text	50%	48%	36%

Gold standard annotations	Epileptologist (n = 1000, 700 training, 300 validation)^a	Neurologist (n = 100)	Non-neurologist (n = 100)
Classification
Seizure-free	30%	33%	30%
Not seizure-free	62%	47%	35%
Unclassified	8%	20%	35%
Note contained seizure frequency text	36%	14%	7%
Note contained date of last seizure text	50%	48%	36%

From our previous study.¹³

Pipeline performance across author specialties

This analysis compared notes from epileptologists, neurologists, and non-neurologists within a single institution. For the classification of seizure freedom, agreement between algorithm predictions and human annotations decreased by 0.17 between epileptologists and neurologists, and decreased further by 0.12 between neurologists and non-neurologists. For extraction of seizure frequency text, model performance (F₁) increased by 0.12 between epileptologists and non-neurologists, and by 0.09 between neurologists and non-neurologists. For extraction of date of last seizure text, model performance (F₁) decreased by 0.07 between epileptologists and neurologists, and decreased by 0.09 between epileptologists and non-neurologists (Figure 1A).

Figure 1.

Generalization of the NLP pipeline on classifying patient seizure freedom and extracting seizure frequency text and date of last seizure text. (A) Classification agreement with ground truth annotations and text overlap F₁ scores on respective outcome measures on epileptologist (red dots), neurologist (blue dots), and non-neurologist (yellow dots) notes using random 5 seeds. Significant differences with P < .0167 were found using 2-sided Mann-Whitney U-tests and are marked with an asterisk. (B) Pipeline text overlap F₁ scores, separated by whether an outcome measure was present in the note, combining the seizure frequency and date of last seizure text extraction tasks.

Performance of these text extraction tasks varied according to whether an answer was present in the note (Figure 1B). The algorithm detected that no answer was present equally well across specialties. However, when an answer did exist, the pipeline’s ability to match the answer’s text significantly decreased for neurologists and non-neurologists compared to epileptologists. Therefore, the apparent increase in model performance for seizure frequency text was likely driven by the higher frequency of notes with no answer across specialties.

Model performance and similarity to training set notes

We hypothesized that notes where the pipeline performed better were more similar to the training set than notes classified incorrectly. For the classification of seizure freedom task, this hypothesis was supported for 3 of 4 similarity measures (Figure 2A): Levenshtein similarity of kernels (P = .012), Levenshtein similarity of contexts (P = 3×10⁻⁸), and cosine similarity of contexts (P = 1×10⁻⁶). Meanwhile, for our 2 text extraction tasks, this hypothesis was also supported for 3 of 4 similarity measures (Figure 2B): Levenshtein similarity of kernels (P = 2×10⁻⁷), the cosine similarity of the kernels (P = 6×10⁻⁶), and the Levenshtein similarity of the context (P = .007). These findings suggest that the pipeline performed better when notes were more similar to the training set. Specifically, for the text classification task, this effect was stronger for the surrounding context than for the kernel (target) text. Meanwhile, in the text extraction task, the opposite can be seen—this effect was stronger for the kernel text than the surrounding context.

Figure 2.

Comparison of text similarity measures. (A) Text similarity of the Kernels and Contexts, stratified by whether the model made a correct seizure freedom classification. (B) Text similarity, stratified by if the model made a prediction with at least 0.5 text overlap F₁ with the ground truth annotation.

Next, we examined the distributions of similarity to the training set for our 3 test sets: epileptologists, neurologists, and non-neurologists. We used Q-Q plots to compare quantiles of each test set to the similarity within the training set (Figure 3). The pattern of results varied across the different similarity measures. Levenshtein similarities, a measure of word similarity regardless of semantic content, for the neurologist and non-neurologist notes diverged from the training set distribution in the upper quantiles, indicating lower maximum similarity values. This was true for both the kernels and contexts. Cosine similarities, a measure of semantic similarity, showed that kernels had slightly higher similarity values around the median for all 3 test sets compared to the training set, but similar values in the tails of all distributions. In contrast, cosine similarity of contexts showed clear shifts of the entire distributions, with the magnitude of shift increasing stepwise from epileptologists to neurologists to non-neurologists. This indicates that the most consistent textual difference across note categories was in the semantic content of the context surrounding the target kernels.

Figure 3.

Q-Q plots of the textual similarity in the reference epileptologist training set (x-axis) against the epileptologist validation set (blue), neurologist set (yellow), and non-neurologist set (green) on the y-axis. The black line indicates the identity line y = x. The dashed lines indicate the medians of each distribution: similarity within the reference epileptologist training set (red), epileptologist validation set (blue), neurologist set (yellow), and non-neurologist set (green). Panels: (A) Levenshtein similarity of kernels. (B) Cosine similarity of kernels. (C) Levenshtein similarity of surrounding contexts. (D) Cosine similarity of surrounding contexts.

Evaluating out-of-institution generalization performance

Epilepsy notes from the University of Michigan had similar distribution of annotations to the Penn epileptologist notes: 23% seizure free, 63% had recent seizures, and 14% were unclassifiable. Thirty-three percent of notes had seizure frequency text and 51% had date of last seizure text. From manual review, 85% of the passages isolated by our preprocessing contained all of the correct information. Overall, our pipeline’s classification performance decreased by 0.12 agreement when making predictions on Michigan epilepsy notes (Figure 4). Our pipeline’s ability to extract seizure frequency and date of last seizure texts remained constant between institutions, achieving 0.89 and 0.87 agreement on Michigan notes, respectively, versus 0.90 and 0.88 agreement on Penn epileptologist notes.

Figure 4.

Generalization of our NLP pipeline on epileptologist, neurologist, and non-neurologist notes from the University of Pennsylvania, and on epileptologist notes from the University of Michigan, measured as agreement between algorithm predictions and human-annotated ground truth.

DISCUSSION

In this study, we evaluated the generalization performance of our NLP pipeline. We tested its ability to accurately extract seizure outcome measures in new datasets without additional finetuning. We found that it was less accurate in classifying seizure freedom in notes written by non-epilepsy specialists. However, it was able to extract seizure frequency text and date of last seizure text from epilepsy notes from the University of Michigan with agreement to the ground truth comparable to epileptologist notes from Penn. Declines in performance in notes written by nonspecialists in epilepsy were associated with different distributions of ground-truth answers across note categories, and with increasing semantic and syntactic distance from the training set.

Overall, these findings indicate that our models, which were trained and validated on clinical notes from a particular medical subspecialty, performed less well on notes written by providers outside that subspecialty, even when those notes addressed the same medical condition. The observation that our models generalized relatively well to subspecialist notes from a different institution suggests that the major driver of performance was the content of notes written by providers from different specialties. These changes in performance are not unseen in previous studies. Machine learning algorithms perform worse in new contexts due to distribution shifts when the underlying distribution of its training data becomes too different from the testing distribution,^24–27 including in clinical domains.^28–32 Pretrained Transformer models are more robust to these shifts, but still show generalization gaps and variable reactions to previously unseen data.³³^,³⁴ Adding to these prior works, our study is among the first to evaluate the generalizability of such algorithms in the clinical domain using multiple sets of expert-annotated real-world data with tangible and impactful applications,¹⁵ rather than data prepared for competitions or benchmarking.

We analyzed textual similarities on 2 fronts using the Levenshtein and cosine similarities, and applied these measures separately to the target text in the note for the pipeline to focus on, and the surrounding content. Decreases in similarities between providers then influenced changes in model performance. Other groups have found similar but have primarily used only the cosine distance on documents as a whole to measure semantic similarity; furthermore, few groups have applied such methods to clinical problems, and none to epilepsy. For example, Khambete et al³⁵ found that model performance in classifying clinical diagnosis sentiment had a significant linear negative correlation with median semantic cosine distance. More studies have exploited this notion of cosine similarity to improve model performance or in novel applications: Chang et al³⁶ used Explicit Semantic Analysis³⁷ with Wikipedia to generate semantic representations of labels and documents for dataless classification. Similarly, Haj-Yahia et al³⁸ and Schopf et al³⁹ semantically matched text to classification labels for unsupervised text classification. Meanwhile, Kongwudhikunakorn et al⁴⁰ used word embeddings and the Word Mover’s Distance⁴¹ to accurately cluster documents.

Methods to improve generalizability are numerous, but have sparsely been applied in the clinical domain. One method to combat distribution shift is to expand the training distribution by incorporating multiple training datasets;⁴²^,⁴³ Hendrycks et al³³ reasoned that pretrained transformers are already more robust to distribution shifts than previous neural models because their pretraining already covered diverse data. Santus et al²⁸ found that models trained on reports from 4 hospitals outperformed models trained on only one. Similarly, Khambete et al³⁵ found that models trained with sentences from multiple medical specialties outperformed models trained from just one. Other studies reframe the problem, improving generalization performance even in the absence of training data. For example, Yin et al⁴⁴ proposed text classification as a textual entailment problem, where the question is converted into a statement (ie, “Does the patient have recent seizures” becomes “The patient is having recent seizures”) and the model is tasked with identifying if that statement is implied by or is contradicted by the main paragraph. Similarly, Halder et al⁴⁵ redesigned generic text classification as repeated true/false classification where a model receives each label separately and in sequence, and must determine if that label is true given the paragraph. Alcoforado et al⁴⁶ used topic modeling for text classification by finding what topics were in a given paragraph, and modeling how those topics entail specific labels.

Our study has some limitations. Our approach is agnostic to the type of seizure and provides only one of each outcome measure per note, potentially missing or confounding information from patients with multiple semiologies. From a clinical perspective, seizures also have varying severities, necessitating different response. For example, bilateral tonic-clonic convulsions may require trips to the hospital, whereas a simple focal aware seizure may not. Our notes and models were affected by copy-forwarded information, where a note author will copy previous notes into the current note, potentially introducing outdated or directly contradictory information. Our evaluation on notes from the University of Michigan also used a single seed to minimize the need for manual review and therefore lacks the statistical rigor of our other analyses. Furthermore, we required some additional tuning of the preprocessing steps when adapting our pipeline to the notes from the University of Michigan, which may limit the ease or speed at which it can be deployed at other institutions with a significantly different note format from ours. Our investigation was focused on the immediate clinical utility of our pipeline in new care contexts and at new institutions, and as such we did not perform a more in-depth and technical analysis by comparing our methods with those of other few-shot and zero-shot models and approaches.

CONCLUSIONS

In conclusion, on nonepileptologist within-institution notes, our NLP pipeline experienced decreased performance when classifying seizure freedom, and variable performance when extracting seizure frequency text and date of last seizure text on certain subsets. Our text extraction algorithm required some adaptation to identify the relevant text in out-of-institution epileptologist notes. However, after identifying the correct text, the decrease in classification performance was more modest; performance on text extraction performance did not change on this dataset. Generalization performance was affected by differences between each dataset and the original training and validation datasets. We plan to improve our overall generalization performance in the future to allow for large-scale multicenter studies.

FUNDING

This research was funded by the National Institute of Neurological Disorders and Stroke (DP1NS122038); by the National Institutes of Health (R01NS125137); the Mirowski Family Foundation; by contributions from Neil and Barbara Smit; and by contributions from Jonathan and Bonnie Rothberg. R.S.G. was supported by the National Institute of Neurological Disorders and Stroke (T32NS091006). C.A.E. was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health Award Number (K23NS121520); by the American Academy of Neurology Susan S. Spencer Clinical Research Training Scholarship; and by the Mirowski Family Foundation. D.R.’s work was partially funded by the Office of Naval Research (Contract N00014-19-1-2620).

AUTHOR CONTRIBUTIONS

K.X., R.S.G., B.L., D.R., and C.A.E. designed the study. K.X., S.W.T., B.L., D.R., and C.A.E. implemented the study. R.S.G., C.E.H., and K.A.D. provided feedback on the methods, design, and manuscript of the study. All authors were involved in the drafting, editing, and approval of the final manuscript.

CONFLICTS OF INTEREST

None declared.

DATA AVAILABILITY

Our code is available at https://github.com/penn-cnt/generalization_of_nlp_in_clinical_contexts. Our models are available in Huggingface Hub at https://huggingface.co/CNT-UPenn.

REFERENCES

Cowie

Blomster

Curtis

, et al.

Electronic health records to facilitate clinical research

Clin Res Cardiol

2017

;

106

(

–

Casey

Schwartz

Stewart

, et al.

Using electronic health records for population health research: a review of methods and applications

Annu Rev Public Health

2016

;

–

Lee

D’Souza

, et al.

Unlocking the potential of electronic health records for health research

Int J Popul Data Sci

2020

;

(

1123

Toledano

Smith

Brook

, et al.

How to establish and follow up a large prospective cohort study in the 21st century—lessons from UK COSMOS

PLoS One

2015

;

(

e0131521

Hripcsak

Duke

Shah

, et al.

Observational health data sciences and informatics (OHDSI): opportunities for observational researchers

Stud Health Technol Inform

2015

;

216

574

–

Patterson

Hurdle

JF.

Document clustering of clinical narratives: a systematic study of clinical sublanguages

AMIA Annu Symp Proc

2011

;

2011

1099

–

107

10.18653/v1/2022.bionlp-1.36

Sohn

Wang

C-I

, et al.

Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions

J Am Med Inform Assoc

2018

;

(

353

–

Deng

Liu

A joint introduction to natural language processing and to deep learning. In:

Deng

Liu

, eds.

Deep Learning in Natural Language Processing

Springer

;

2018

–

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems. Curran Associates, Inc.;

2017

. https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Accessed April 12, 2022.

Alsentzer

Murphy

Boag

, et al. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics;

2019

–

. doi:

10.18653/v1/W19-1909

Devlin

Chang

M-W

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1 (Long and Short Papers). Association for Computational Linguistics;

2019

4171

–

. doi:

10.18653/v1/N19-1423

Liu

Ott

Goyal

, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv:190711692 [cs]. Published online first July 26, 2019. http://arxiv.org/abs/1907.11692. October 22, 2021.

Xie

Gallagher

Conrad

, et al.

Extracting seizure frequency from epilepsy clinic notes: a machine reading approach to natural language processing

J Am Med Inform Assoc

2022

;

(

873

–

Xie

Litt

Roth

, et al. Quantifying clinical outcome measures in patients with epilepsy using the electronic health record. In: Proceedings of the 21st Workshop on Biomedical Language Processing. Association for Computational Linguistics;

2022

369

–

. doi:

Xie

Gallagher

Shinohara

, et al.

Long term epilepsy outcome dynamics revealed by natural language processing of clinic notes

Epilepsia

2023

;

(

1900

–

Wolf

Debut

Sanh

, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics;

2020

–

. doi:

10.18653/v1/2020.emnlp-demos.6

Levenshtein

VI.

Binary codes capable of correcting deletions, insertions and reversals

Sov Phys Dokl

1966

;

707

Mikolov

Chen

Corrado

, et al. Efficient estimation of word representations in vector space.

2013

. https://doi.org/10.48550/arXiv.1301.3781. Accessed February 27, 2023.

Reimers

Gurevych

Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics;

2019

3982

–

. doi:

10.18653/v1/D19-1410

Decker

Turco

, et al.

Development of a natural language processing algorithm to extract seizure types and frequencies from the electronic health record

Seizure

2022

;

101

–

Cohen

A coefficient of agreement for nominal scales

Educ Psychol Meas

1960

;

(

–

Crossref

McHugh

ML.

Interrater reliability: the kappa statistic

Biochem Med

2012

;

276

–

Lin

Lucas

Shmueli

Research commentary: too big to fail: large samples and the p-value problem

Inf Syst Res

2013

;

(

906

–

Crossref

Geirhos

Jacobsen

J-H

Michaelis

, et al.

Shortcut learning in deep neural networks

Nat Mach Intell

2020

;

(

665

–

Crossref

Kumar

Liang

Understanding self-training for gradual domain adaptation. In: Proceedings of the 37th International Conference on Machine Learning. JMLR.org;

2020

5468

–

Miller

Krauth

Recht

, et al. The effect of natural distribution shift on question answering models. In: Proceedings of the 37th International Conference on Machine Learning. PMLR;

2020

6905

–

. https://proceedings.mlr.press/v119/miller20a.html. Accessed March 6, 2023.

Ben-David

Blitzer

Crammer

, et al.

A theory of learning from different domains

Mach Learn

2010

;

(

1–2

151

–

Santus

Yala

, et al.

Do neural information extraction algorithms generalize across institutions?

JCO Clin Cancer Inform

2019

;

–

Guo

Pfohl

Fries

, et al.

Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine

Sci Rep

2022

;

(

2726

Subbaswamy

Saria

From development to deployment: dataset shift, causality, and shift-stable models in health AI

Biostatistics

2020

;

(

345

–

10.1007/978-3-030-62469-9_7

Pooch

EHP

Ballester

Barros

RC.

Can we trust deep learning based diagnosis? The impact of domain shift in chest radiograph classification. In: Thoracic Image Analysis: Second International Workshop, TIA 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 8, 2020, Proceedings. Springer-Verlag;

2020

–

. doi:

Zhang

Dullerud

Seyyed-Kalantari

, et al. An empirical framework for domain generalization in clinical settings. In: Proceedings of the Conference on Health, Inference, and Learning. Association for Computing Machinery;

2021

279

–

. doi:

10.1145/3450439.3451878

Hendrycks

Liu

Wallace

, et al. Pretrained transformers improve out-of-distribution robustness. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics;

2020

2744

–

. doi:

10.18653/v1/2020.acl-main.244

McCoy

Min

Linzen

BERTs of a feather do not generalize together: large variability in generalization across models with similar test set performance. In: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics;

2020

217

–

. doi:

10.18653/v1/2020.blackboxnlp-1.21

Khambete

Garcia

, et al.

Quantification of BERT diagnosis generalizability across medical specialties using semantic dataset distance

AMIA Jt Summits Transl Sci Proc

2021

;

2021

345

–

Chang

M-W

Ratinov

Roth

, et al. Importance of semantic representation: dataless classification. In: Proceedings of the 23rd National Conference on Artificial Intelligence. Vol. 2. AAAI Press;

2008

830

–

Gabrilovich

Markovitch

Computing semantic relatedness using Wikipedia-based explicit semantic analysis In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc.;

2007

1606

–

Haj-Yahia

Sieg

Deleris

LA.

Towards unsupervised text classification leveraging experts and word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics;

2019

371

–

. doi:

10.18653/v1/P19-1036

Schopf

Braun

Matthes

Evaluating unsupervised text classification: zero-shot and similarity-based approaches.

2023

. https://doi.org/10.48550/arXiv.2211.16285. Accessed March 6, 2023.

Kongwudhikunakorn

Waiyamai

Combining distributed word representation and document distance for short text document clustering

J Inf Process Syst

2020

;

277

–

300