Deploying a national clinical text processing infrastructure Open Access

Steps of developed clinical text processing framework.

Steps of text processing framework
Data centralization	Daily update of free text notes into data platform
Data pre-processing	Joining relevant tables Special character removal PII masking
Data annotation	eHOST software Pre-highlighting of relevant terms
Model training	Pre-written scripts and workflows using open-source frameworks such as PyTorch,¹¹ Nvidia Nemo,¹² Azure Machine Learning,¹³ Hugging Face Transformers,¹⁴ Apache Spark,¹⁵ and Bidirectional Encoder Representations from Transformers (BERT) based models¹⁶
Productionizing	Daily model runs Pipelines connecting text processing output to operational applications Future: Promoting data products to be discoverable across projects

Steps of text processing framework
Data centralization	Daily update of free text notes into data platform
Data pre-processing	Joining relevant tables Special character removal PII masking
Data annotation	eHOST software Pre-highlighting of relevant terms
Model training	Pre-written scripts and workflows using open-source frameworks such as PyTorch,¹¹ Nvidia Nemo,¹² Azure Machine Learning,¹³ Hugging Face Transformers,¹⁴ Apache Spark,¹⁵ and Bidirectional Encoder Representations from Transformers (BERT) based models¹⁶
Productionizing	Daily model runs Pipelines connecting text processing output to operational applications Future: Promoting data products to be discoverable across projects

Table 1.

Steps of developed clinical text processing framework.

Steps of text processing framework
Data centralization	Daily update of free text notes into data platform
Data pre-processing	Joining relevant tables Special character removal PII masking
Data annotation	eHOST software Pre-highlighting of relevant terms
Model training	Pre-written scripts and workflows using open-source frameworks such as PyTorch,¹¹ Nvidia Nemo,¹² Azure Machine Learning,¹³ Hugging Face Transformers,¹⁴ Apache Spark,¹⁵ and Bidirectional Encoder Representations from Transformers (BERT) based models¹⁶
Productionizing	Daily model runs Pipelines connecting text processing output to operational applications Future: Promoting data products to be discoverable across projects

Steps of text processing framework
Data centralization	Daily update of free text notes into data platform
Data pre-processing	Joining relevant tables Special character removal PII masking
Data annotation	eHOST software Pre-highlighting of relevant terms
Model training	Pre-written scripts and workflows using open-source frameworks such as PyTorch,¹¹ Nvidia Nemo,¹² Azure Machine Learning,¹³ Hugging Face Transformers,¹⁴ Apache Spark,¹⁵ and Bidirectional Encoder Representations from Transformers (BERT) based models¹⁶
Productionizing	Daily model runs Pipelines connecting text processing output to operational applications Future: Promoting data products to be discoverable across projects

Step 1: Data centralization

We created a central repository in HDAP where free text notes are regularly updated and analyzed. We set up a workflow using Azure workflows to deposit a daily update of EMR notes (Radiology, Pathology, Text Integration Utility (TIU) (all other clinical progress notes)¹⁷). VA generates 1.5-2 million TIU notes per day, and the total size of all notes is 9 Terabytes compressed, so a big data platform like HDAP was crucial for the extraction and manipulation of these data.

Step 2: Data pre-processing

For both use cases, we extract about 150 million free text notes. Data were cleaned by removing special characters and masking personally identifiable information via a custom regex-based method. This pre-processed dataset is available for all projects, kickstarting the initial data gathering step and reducing the overall computation cost for projects using these datasets.

Step 3: Data annotation

After pre-processing, the projects differ for initial entity extraction. For SDoH, a clinical concept recognition tool, CLEVER, extracts free-text snippets containing SDoH terms from TIU notes to create the training and testing data.¹⁸^,¹⁹ For CCTS, VA-based radiology subject matter experts identified terms for lung nodules and flags for reviewing follow-up cases. We also integrated the data annotation software eHOST for expert annotation to improve upon the initial labels for the models.²⁰ We are currently working with subject matter experts to annotate terms for both projects.

Step 4: Model training

Training and inferencing were fine-tuned from a base Nvidia BioMegatron large language model using Azure Distributed GPU cluster computes.

Step 5: Productionizing

We have started productionizing models by establishing daily prediction runs using Azure Machine Learning and Azure Data Factory, ensuring models run as soon as the day’s data are available. We also set-up pipelines to transfer results to downstream systems. At VA, operational clinical applications run in a variety of environments—CDW is one of the most common locations and applications are increasingly being powered by HDAP. Since many applications run in CDW, including the mental health dashboards for the SDoH use case, we set-up a daily ETL from HDAP to CDW where text extractions are transferred and loaded into an on-prem CDW SQL database. For applications that run in HDAP, such as CCTS, extractions are loaded directly into a cloud SQL database. These 2 workflows provide the ability for text extractions to be integrated into a wide range of VA applications.

Onboarding new use cases

We developed a system for onboarding new use cases that can leverage this infrastructure, detailed in Table 2. We have established a data science team on HDAP that assists with training, onboarding new use cases, and other support activities.

Table 2.

Pathways for new use cases.

Onboarding new use cases
Intake and onboarding	Users submit project intake request to be onboarded to HDAP.
Data access	Currently available notes	Onboarded users have immediate access to the centralized and cleaned notes that we currently process (TIU, pathology, radiology) (from step 1).
Data access	New types of notes	HDAP already has daily updates of all major free text notes in VA; however new types of notes can also be uploaded, if external, or pulled into our daily ETL processes, if they already exist elsewhere in our systems.
Data annotation	A supported data annotation environment (including eHOST) is available for use across projects with set-up and access supported by our internal HDAP data science team. Scripts and workflows are also available for document prepping, content loading and schema conversion, to send data to and from eHOST schemas. For the SDoH use case, a public website is used to support annotation and provides guidelines and concept definitions. This is the current link: http://ec2-18-206-230-88.compute-1.amazonaws.com/wordpress/?page_id=556
Model training	We provide code and workflows for model training and fine-tuning from data annotation outputs, as well as a range of open-source models that have been brought into HDAP.
Productionizing new use cases	Current downstream locations (HDAP and CDW)	When text processing algorithms are ready for productionization, they can be included in the current daily model running workflows and ETLs, with support from our HDAP data science team.
Productionizing new use cases	New downstream operational locations	When necessary, additional ETLs to new locations can be established by the HDAP data science team.

Onboarding new use cases
Intake and onboarding	Users submit project intake request to be onboarded to HDAP.
Data access	Currently available notes	Onboarded users have immediate access to the centralized and cleaned notes that we currently process (TIU, pathology, radiology) (from step 1).
Data access	New types of notes	HDAP already has daily updates of all major free text notes in VA; however new types of notes can also be uploaded, if external, or pulled into our daily ETL processes, if they already exist elsewhere in our systems.
Data annotation	A supported data annotation environment (including eHOST) is available for use across projects with set-up and access supported by our internal HDAP data science team. Scripts and workflows are also available for document prepping, content loading and schema conversion, to send data to and from eHOST schemas. For the SDoH use case, a public website is used to support annotation and provides guidelines and concept definitions. This is the current link: http://ec2-18-206-230-88.compute-1.amazonaws.com/wordpress/?page_id=556
Model training	We provide code and workflows for model training and fine-tuning from data annotation outputs, as well as a range of open-source models that have been brought into HDAP.
Productionizing new use cases	Current downstream locations (HDAP and CDW)	When text processing algorithms are ready for productionization, they can be included in the current daily model running workflows and ETLs, with support from our HDAP data science team.
Productionizing new use cases	New downstream operational locations	When necessary, additional ETLs to new locations can be established by the HDAP data science team.

Table 2.

Pathways for new use cases.

Onboarding new use cases
Intake and onboarding	Users submit project intake request to be onboarded to HDAP.
Data access	Currently available notes	Onboarded users have immediate access to the centralized and cleaned notes that we currently process (TIU, pathology, radiology) (from step 1).
Data access	New types of notes	HDAP already has daily updates of all major free text notes in VA; however new types of notes can also be uploaded, if external, or pulled into our daily ETL processes, if they already exist elsewhere in our systems.
Data annotation	A supported data annotation environment (including eHOST) is available for use across projects with set-up and access supported by our internal HDAP data science team. Scripts and workflows are also available for document prepping, content loading and schema conversion, to send data to and from eHOST schemas. For the SDoH use case, a public website is used to support annotation and provides guidelines and concept definitions. This is the current link: http://ec2-18-206-230-88.compute-1.amazonaws.com/wordpress/?page_id=556
Model training	We provide code and workflows for model training and fine-tuning from data annotation outputs, as well as a range of open-source models that have been brought into HDAP.
Productionizing new use cases	Current downstream locations (HDAP and CDW)	When text processing algorithms are ready for productionization, they can be included in the current daily model running workflows and ETLs, with support from our HDAP data science team.
Productionizing new use cases	New downstream operational locations	When necessary, additional ETLs to new locations can be established by the HDAP data science team.

Onboarding new use cases
Intake and onboarding	Users submit project intake request to be onboarded to HDAP.
Data access	Currently available notes	Onboarded users have immediate access to the centralized and cleaned notes that we currently process (TIU, pathology, radiology) (from step 1).
Data access	New types of notes	HDAP already has daily updates of all major free text notes in VA; however new types of notes can also be uploaded, if external, or pulled into our daily ETL processes, if they already exist elsewhere in our systems.
Data annotation	A supported data annotation environment (including eHOST) is available for use across projects with set-up and access supported by our internal HDAP data science team. Scripts and workflows are also available for document prepping, content loading and schema conversion, to send data to and from eHOST schemas. For the SDoH use case, a public website is used to support annotation and provides guidelines and concept definitions. This is the current link: http://ec2-18-206-230-88.compute-1.amazonaws.com/wordpress/?page_id=556
Model training	We provide code and workflows for model training and fine-tuning from data annotation outputs, as well as a range of open-source models that have been brought into HDAP.
Productionizing new use cases	Current downstream locations (HDAP and CDW)	When text processing algorithms are ready for productionization, they can be included in the current daily model running workflows and ETLs, with support from our HDAP data science team.
Productionizing new use cases	New downstream operational locations	When necessary, additional ETLs to new locations can be established by the HDAP data science team.

Discussion

Despite the plethora of text processing research in healthcare, few projects have bridged the gap from research to operations. To address this challenge, we implemented a text processing infrastructure within the VA, creating a framework that simplifies prototyping, deploying, and integrating text processing projects into healthcare operations. This project also sets up a framework for sharing resources (eg, data, code, production pipelines) among groups to reduce silos and lower the barrier to start future projects. Crucially, this project had software and machine learning experts working closely with clinical workflows and clinical experts, ensuring projects were grounded in clinical use cases and were developed with healthcare operations in mind from the beginning.

Overall, building text processing infrastructure that bridges use cases was a more efficient use of human and compute resources since many steps are shared between use cases. However, we identified a few needed customizations between use cases, particularly in the data term annotation process. To increase expert annotation speed, we pre-highlighted potential terms for annotators to accept, reject, or reject and highlight elsewhere, with the hypothesis that this pre-highlighting would save annotation time and allow more notes to be annotated. We found this approach worked well for the radiology use case, where the pre-highlighted terms were quite accurate and the term vocabulary was limited. However, for the SDoH use case, we found that auto-highlighting note text resulted in the annotators reviewing the notes too quickly, leading to low-quality annotations. This was likely due to the relative complexity of annotating SDoH data from unstructured EHR data versus more established biomedical concepts, such as nodule size, character, or change in dimension. This demonstrates the importance of testing the annotation steps for each unique use case first to identify the most accurate approach.

The next steps are to continue improving the text processing algorithms and infrastructure, including integrating use cases directly into healthcare operations, and establishing a framework for model monitoring and evaluation of clinical benefit while also enhancing the framework’s usability and scalability. We are expanding the use cases on our platform, having onboarded 3 more use cases over the past few months. Furthermore, we are developing governance strategies for text extractions from EHR data to make them more useable and discoverable across VA.

Acknowledgments

We thank the VA Corporate Data Warehouse (CDW) team, the VA Health Data and Analytics Platform (HDAP) team, the Program Evaluation Resource Center (PERC) Platform Support team, and the VA Office of the CTO (OCTO) for their work in developing VA data platforms, as well as centralizing, maintaining, and processing VA EHR data. Particularly, we thank Sruti Kanthan and Michael Slentz from the HDAP data science team. We thank Dr. Nigam Shah for comments on the manuscript.

Author contributions

K.F.M. and S.R.T. wrote the article. S.S. and F.L.L. developed the CCTS application. N.C. served as a radiology expert. S.F., E.M.F., and G.G. served as NLP technical experts. J.M.S. and A.S.S. developed machine learning models and code. J.S. and J.D. contributed to underlying data aggregation and management. K.F.M., T.H.T., J.E., J.A.T., and S.R.T. oversaw the work.

Funding

The project received no specific grant from any funding agency in the public, commercial, or not-for-project sectors.

Conflicts of interest

None declared.

Data availability

There are not specific data to release related to this project.

References

Patra

Sharma

Vekaria

, et al.

Extracting social determinants of health from electronic health records using natural language processing: a systematic review

J Am Med Inform Assoc

2021

;

(

2716

2727

Osorio

Huang

An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes

Comput Methods Programs Biomed

2019

;

177

141

153

PubMed

Pan

Goldwasser

, et al.

Neural natural language processing for unstructured data in electronic health records: a review

Comput Sci Rev

2022

;

100511

Crossref

Lederman

Verspoor

Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support

J Am Med Inform Assoc

2022

;

(

1810

1817

Tamang

Humbert-Droz

Gianfrancesco

, et al.

Considerations for developing clinical natural language processing systems for population health management and measurement

JMIR Med Inform

2023

;

(

e37805

PubMed

Wen

Moon

, et al.

Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation

NPJ Digit Med

2019

;

(

130

Zhang

Mezzacappa

Shen

, et al.

Cancer tracking system improves timeliness of live cancer care at a Veterans Hospital: a comparison of cohorts before and after implementation of an automated care coordination tool

PLoS Digit Health

2022

;

(

e0000080

The Gravity Project

. Accessed October 28, 2023. https://thegravityproject.net

Oliva

Bowe

Tavakoli

, et al.

Development and applications of the Veterans Health Administration’s Stratification Tool for Opioid Risk Mitigation (STORM) to improve opioid safety and prevent overdose and suicide

Psychol Serv

2017

;

(

Crossref

McCarthy

Cooper

Dent

, et al.

Evaluation of the recovery engagement and coordination for health-veterans enhanced treatment suicide risk modeling clinical program in the Veterans Health Administration

JAMA Netw Open

2021

;

(

e2129900

Paszke

Gross

Massa

, et al.

Pytorch: an imperative style, high-performance deep learning library

Adv Neural Inf Process Syst

2019

;

-12.

Kuchaiev

Nguyen

, et al.

2019

. Nemo: a toolkit for building ai applications using neural modules, arXiv, arXiv:1909.09577, preprint: not peer reviewed.

Azure Machine Learning

. Accessed June 23, 2023. https://azure.microsoft.com/en-us/products/machine-learning

Wolf

Lysandre

Sanh

, et al.

2019

. Huggingface’s transformers: state-of-the-art natural language processing, arXiv, arXiv:1910.03771, preprint: not peer reviewed.

Zaharia

Xin

Wendell

, et al.

Apache spark: a unified engine for big data processing

Commun ACM

2016

;

(

Crossref

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Minneapolis, MN. Association for Computational Linguistics;

2019

4171

4186

Department of Veterans Affairs Text Integration Utilities (TIU) Technical Manual—May 2023

. Accessed July 15, 2023. https://www.va.gov/vdl/documents/Clinical/CPRS-Text_Integration_Utility_(TIU)/tiutm.pdf

Tamang

Hernandez-Boussard

Ross

, et al.

Enhanced quality measurement event detection: an application to physician reporting

EGEMS (Wash DC)

2017

;

(

PubMed