Validating, Implementing, and Monitoring Machine Learning Solutions in the Clinical Laboratory Safely and Effectively

Abstract

Background

Machine learning solutions offer tremendous promise for improving clinical and laboratory operations in pathology. Proof-of-concept descriptions of these approaches have become commonplace in laboratory medicine literature, but only a scant few of these have been implemented within clinical laboratories, owing to the often substantial barriers in validating, implementing, and monitoring these applications in practice. This mini-review aims to highlight the key considerations in each of these steps.

Content

Effective and responsible applications of machine learning in clinical laboratories require robust validation prior to implementation. A comprehensive validation study involves a critical evaluation of study design, data engineering and interoperability, target label definition, metric selection, generalizability and applicability assessment, algorithmic fairness, and explainability. While the main text highlights these concepts in broad strokes, a supplementary code walk-through is also provided to facilitate a more practical understanding of these topics using a real-world classification task example, the detection of saline-contaminated chemistry panels.

Following validation, the laboratorian's role is far from over. Implementing machine learning solutions requires an interdisciplinary effort across several roles in an organization. We highlight the key roles, responsibilities, and terminologies for successfully deploying a validated solution into a live production environment. Finally, the implemented solution must be routinely monitored for signs of performance degradation and updated if necessary.

Summary

This mini-review aims to bridge the gap between theory and practice by highlighting key concepts in validation, implementation, and monitoring machine learning solutions effectively and responsibly in the clinical laboratory.

Introduction

Machine learning (ML) is a discipline within artificial intelligence (AI) dedicated to the development of algorithms that improve their performance on a human task without requiring explicit instruction. These algorithms, or models, have demonstrated the potential to automate or augment human performance across a variety of relevant tasks within the clinical laboratory (1–5). However, the transition from proof-of-concept descriptions in primary literature to tangible clinical benefits requires extensive validation and carefully crafted implementation strategies.

Once rigorously validated, implementation begins by integrating the model into a “pipeline,” which transforms raw input data, processes it into a model-ready format, and outputs model predictions. This process requires substantial subject matter expertise, robust information technology (IT) infrastructure, and meticulously developed software engineering solutions—collectively deemed, MLOps. Once implemented, ML solutions must be continuously monitored for signs of performance degradation, as the real-world clinical environment in which it is used may change over time.

This mini-review will highlight the key considerations for validating, implementing, and monitoring ML solutions in the clinical laboratory. To facilitate learning, we provide a Supplemental Website with a practical walk-through that highlights these concepts through the lens of a real-world classification task, detecting saline-contaminated chemistry results (see Supplemental Data). Additionally, italicized terms are defined in the glossary provided in Table 1. Of note, this review will not discuss the best practices for developing ML solutions, described effectively elsewhere (6, 7), nor will it cover the unique considerations for image-based or generative AI applications.

Table 1.

Open in new tab

Glossary of key terms.

Artificial intelligence (AI)

A computer system that can simulate or perform human tasks.

Machine learning (ML)

A discipline within artificial intelligence (AI) dedicated to the development of algorithms that improve their performance on a human task without requiring explicit instructions.

ML algorithm/model

A learned representation of the patterns inherent with the input data that can be used to generate predictions.

Pipeline

A full sequence of steps to convert input data into output predictions. Typically involving loading, reformatting, and transforming data and predictions so that they can be integrating into real-time work flows.

Development Operations (DevOps)

A philosophical framework combining best practices in information technology operations and software engineering to rapidly and robustly build and implement high-quality informatics solutions.

Machine Learning Operations (MLOps)

A framework of technical best practices for deploying and maintaining machine learning applications efficiently and effectively.

Target/Ground Truth

The gold standard definition to which ML pipeline predictions will be compared.

Discriminative performance

The degree to which predictions match the ground truth labels, often measured by sensitivity, specificity, positive predictive value, and negative predictive value.

Implementation efficacy

The degree to which the final implementation of the ML pipeline satisfies the original need for which it was built.

Receiver operating characteristic (ROC curve)

A visualization of the trade-off between sensitivity and specificity across the full breadth of possible decision thresholds for a continuous output.

Class imbalance

The degree to which the proportion of class labels are skewed towards one label or the other.

Precision-recall curve

A visualization of the trade-off between precision (positive predictive value) and recall (sensitivity) across the full breadth of possible decision thresholds for a continuous output.

F₁ score

The harmonic mean of precision and recall.

Matthews correlation coefficient (MCC)

The Pearson correlation coefficient for 2 binary variables.

Cost-sensitive learning

A learning approach where each classification error is assigned its own weight to fine-tune the model's predilection towards certain error type.

Equivocal zone

An interval of continuous output in which no binary class label is assigned.

Applicability assessment

The determination of whether a new input is similar enough to a model's training data for a reliable prediction.

Demographic parity

A fairness criterion used to assess whether the outputs of a predictive model are independent of demographic groups (e.g., race, gender, or age).

Predictive parity

A fairness criterion used to assess whether the outputs of a model have equal positive and negative predictive values across demographic groups (e.g., race, gender, or age).

Equalized odds

A fairness criterion used to assess whether the outputs of a model have equal true-positive rates (sensitivity) and false-positive rates (specificity) across demographic groups (e.g., race, gender, or age).

Global explainability

The ability to estimate each feature's impact on model outputs aggregated across an entire data set or feature space.

Local explainability

The ability to estimate each feature's impact on model outputs for any given individual prediction.

Governance

The processes by which organizational responsibilities and decisions are divided, evaluated, and executed.

Deployment

Making a model or application accessible to other computers within a network.

Production environment

The software systems and infrastructure in which live applications are hosted and run for day-to-day operations.

Development environment

An isolated copy of the production environment where software changes can be tested without risk of impacting live operations.

Application-programming interface (API)

A protocol or framework by which various software applications can communicate with each other and exchange data or predictions.

Human-in-the-loop

An implementation paradigm where model outputs are directed towards an expert user for incorporating into their decision-making before an action is taken.

Data drift

Divergence in input data away from the initial model training data set.

Concept drift

Divergence away from the training data in the target labels or context in which predictions are to be made.

Continuous integration (CI)

A DevOps principle in which changes to software are incorporated in small, manageable chunks continuously rather than large overhauls.

Continuous deployment (CD)

A DevOps principle in which the process of updating live software applications is undertaken without large periods of maintenance or down-time.

Artificial intelligence (AI)	A computer system that can simulate or perform human tasks.
Machine learning (ML)	A discipline within artificial intelligence (AI) dedicated to the development of algorithms that improve their performance on a human task without requiring explicit instructions.
ML algorithm/model	A learned representation of the patterns inherent with the input data that can be used to generate predictions.
Pipeline	A full sequence of steps to convert input data into output predictions. Typically involving loading, reformatting, and transforming data and predictions so that they can be integrating into real-time work flows.
Development Operations (DevOps)	A philosophical framework combining best practices in information technology operations and software engineering to rapidly and robustly build and implement high-quality informatics solutions.
Machine Learning Operations (MLOps)	A framework of technical best practices for deploying and maintaining machine learning applications efficiently and effectively.
Target/Ground Truth	The gold standard definition to which ML pipeline predictions will be compared.
Discriminative performance	The degree to which predictions match the ground truth labels, often measured by sensitivity, specificity, positive predictive value, and negative predictive value.
Implementation efficacy	The degree to which the final implementation of the ML pipeline satisfies the original need for which it was built.
Receiver operating characteristic (ROC curve)	A visualization of the trade-off between sensitivity and specificity across the full breadth of possible decision thresholds for a continuous output.
Class imbalance	The degree to which the proportion of class labels are skewed towards one label or the other.
Precision-recall curve	A visualization of the trade-off between precision (positive predictive value) and recall (sensitivity) across the full breadth of possible decision thresholds for a continuous output.
F₁ score	The harmonic mean of precision and recall.
Matthews correlation coefficient (MCC)	The Pearson correlation coefficient for 2 binary variables.
Cost-sensitive learning	A learning approach where each classification error is assigned its own weight to fine-tune the model's predilection towards certain error type.
Equivocal zone	An interval of continuous output in which no binary class label is assigned.
Applicability assessment	The determination of whether a new input is similar enough to a model's training data for a reliable prediction.
Demographic parity	A fairness criterion used to assess whether the outputs of a predictive model are independent of demographic groups (e.g., race, gender, or age).
Predictive parity	A fairness criterion used to assess whether the outputs of a model have equal positive and negative predictive values across demographic groups (e.g., race, gender, or age).
Equalized odds	A fairness criterion used to assess whether the outputs of a model have equal true-positive rates (sensitivity) and false-positive rates (specificity) across demographic groups (e.g., race, gender, or age).
Global explainability	The ability to estimate each feature's impact on model outputs aggregated across an entire data set or feature space.
Local explainability	The ability to estimate each feature's impact on model outputs for any given individual prediction.
Governance	The processes by which organizational responsibilities and decisions are divided, evaluated, and executed.
Deployment	Making a model or application accessible to other computers within a network.
Production environment	The software systems and infrastructure in which live applications are hosted and run for day-to-day operations.
Development environment	An isolated copy of the production environment where software changes can be tested without risk of impacting live operations.
Application-programming interface (API)	A protocol or framework by which various software applications can communicate with each other and exchange data or predictions.
Human-in-the-loop	An implementation paradigm where model outputs are directed towards an expert user for incorporating into their decision-making before an action is taken.
Data drift	Divergence in input data away from the initial model training data set.
Concept drift	Divergence away from the training data in the target labels or context in which predictions are to be made.
Continuous integration (CI)	A DevOps principle in which changes to software are incorporated in small, manageable chunks continuously rather than large overhauls.
Continuous deployment (CD)	A DevOps principle in which the process of updating live software applications is undertaken without large periods of maintenance or down-time.

Table 1.

Open in new tab

Glossary of key terms.

Artificial intelligence (AI)

A computer system that can simulate or perform human tasks.

Machine learning (ML)

A discipline within artificial intelligence (AI) dedicated to the development of algorithms that improve their performance on a human task without requiring explicit instructions.

ML algorithm/model

A learned representation of the patterns inherent with the input data that can be used to generate predictions.

Pipeline

Development Operations (DevOps)

A philosophical framework combining best practices in information technology operations and software engineering to rapidly and robustly build and implement high-quality informatics solutions.

Machine Learning Operations (MLOps)

A framework of technical best practices for deploying and maintaining machine learning applications efficiently and effectively.

Target/Ground Truth

The gold standard definition to which ML pipeline predictions will be compared.

Discriminative performance

The degree to which predictions match the ground truth labels, often measured by sensitivity, specificity, positive predictive value, and negative predictive value.

Implementation efficacy

The degree to which the final implementation of the ML pipeline satisfies the original need for which it was built.

Receiver operating characteristic (ROC curve)

A visualization of the trade-off between sensitivity and specificity across the full breadth of possible decision thresholds for a continuous output.

Class imbalance

The degree to which the proportion of class labels are skewed towards one label or the other.

Precision-recall curve

A visualization of the trade-off between precision (positive predictive value) and recall (sensitivity) across the full breadth of possible decision thresholds for a continuous output.

F₁ score

The harmonic mean of precision and recall.

Matthews correlation coefficient (MCC)

The Pearson correlation coefficient for 2 binary variables.

Cost-sensitive learning

A learning approach where each classification error is assigned its own weight to fine-tune the model's predilection towards certain error type.

Equivocal zone

An interval of continuous output in which no binary class label is assigned.

Applicability assessment

The determination of whether a new input is similar enough to a model's training data for a reliable prediction.

Demographic parity

A fairness criterion used to assess whether the outputs of a predictive model are independent of demographic groups (e.g., race, gender, or age).

Predictive parity

A fairness criterion used to assess whether the outputs of a model have equal positive and negative predictive values across demographic groups (e.g., race, gender, or age).

Equalized odds

Global explainability

The ability to estimate each feature's impact on model outputs aggregated across an entire data set or feature space.

Local explainability

The ability to estimate each feature's impact on model outputs for any given individual prediction.

Governance

The processes by which organizational responsibilities and decisions are divided, evaluated, and executed.

Deployment

Making a model or application accessible to other computers within a network.

Production environment

The software systems and infrastructure in which live applications are hosted and run for day-to-day operations.

Development environment

An isolated copy of the production environment where software changes can be tested without risk of impacting live operations.

Application-programming interface (API)

A protocol or framework by which various software applications can communicate with each other and exchange data or predictions.

Human-in-the-loop

An implementation paradigm where model outputs are directed towards an expert user for incorporating into their decision-making before an action is taken.

Data drift

Divergence in input data away from the initial model training data set.

Concept drift

Divergence away from the training data in the target labels or context in which predictions are to be made.

Continuous integration (CI)

A DevOps principle in which changes to software are incorporated in small, manageable chunks continuously rather than large overhauls.

Continuous deployment (CD)

A DevOps principle in which the process of updating live software applications is undertaken without large periods of maintenance or down-time.

Artificial intelligence (AI)	A computer system that can simulate or perform human tasks.
Machine learning (ML)	A discipline within artificial intelligence (AI) dedicated to the development of algorithms that improve their performance on a human task without requiring explicit instructions.
ML algorithm/model	A learned representation of the patterns inherent with the input data that can be used to generate predictions.
Pipeline	A full sequence of steps to convert input data into output predictions. Typically involving loading, reformatting, and transforming data and predictions so that they can be integrating into real-time work flows.
Development Operations (DevOps)	A philosophical framework combining best practices in information technology operations and software engineering to rapidly and robustly build and implement high-quality informatics solutions.
Machine Learning Operations (MLOps)	A framework of technical best practices for deploying and maintaining machine learning applications efficiently and effectively.
Target/Ground Truth	The gold standard definition to which ML pipeline predictions will be compared.
Discriminative performance	The degree to which predictions match the ground truth labels, often measured by sensitivity, specificity, positive predictive value, and negative predictive value.
Implementation efficacy	The degree to which the final implementation of the ML pipeline satisfies the original need for which it was built.
Receiver operating characteristic (ROC curve)	A visualization of the trade-off between sensitivity and specificity across the full breadth of possible decision thresholds for a continuous output.
Class imbalance	The degree to which the proportion of class labels are skewed towards one label or the other.
Precision-recall curve	A visualization of the trade-off between precision (positive predictive value) and recall (sensitivity) across the full breadth of possible decision thresholds for a continuous output.
F₁ score	The harmonic mean of precision and recall.
Matthews correlation coefficient (MCC)	The Pearson correlation coefficient for 2 binary variables.
Cost-sensitive learning	A learning approach where each classification error is assigned its own weight to fine-tune the model's predilection towards certain error type.
Equivocal zone	An interval of continuous output in which no binary class label is assigned.
Applicability assessment	The determination of whether a new input is similar enough to a model's training data for a reliable prediction.
Demographic parity	A fairness criterion used to assess whether the outputs of a predictive model are independent of demographic groups (e.g., race, gender, or age).
Predictive parity	A fairness criterion used to assess whether the outputs of a model have equal positive and negative predictive values across demographic groups (e.g., race, gender, or age).
Equalized odds	A fairness criterion used to assess whether the outputs of a model have equal true-positive rates (sensitivity) and false-positive rates (specificity) across demographic groups (e.g., race, gender, or age).
Global explainability	The ability to estimate each feature's impact on model outputs aggregated across an entire data set or feature space.
Local explainability	The ability to estimate each feature's impact on model outputs for any given individual prediction.
Governance	The processes by which organizational responsibilities and decisions are divided, evaluated, and executed.
Deployment	Making a model or application accessible to other computers within a network.
Production environment	The software systems and infrastructure in which live applications are hosted and run for day-to-day operations.
Development environment	An isolated copy of the production environment where software changes can be tested without risk of impacting live operations.
Application-programming interface (API)	A protocol or framework by which various software applications can communicate with each other and exchange data or predictions.
Human-in-the-loop	An implementation paradigm where model outputs are directed towards an expert user for incorporating into their decision-making before an action is taken.
Data drift	Divergence in input data away from the initial model training data set.
Concept drift	Divergence away from the training data in the target labels or context in which predictions are to be made.
Continuous integration (CI)	A DevOps principle in which changes to software are incorporated in small, manageable chunks continuously rather than large overhauls.
Continuous deployment (CD)	A DevOps principle in which the process of updating live software applications is undertaken without large periods of maintenance or down-time.

Validation: Effective Appraisal of ML Solutions

ML solutions offer immense potential for improving clinical and operational facets of laboratory medicine. These solutions must be implemented to address unmet needs as precisely and effectively as possible, without introducing new or exacerbating existing problems (8). Effective and responsible (9) implementation is predicated on comprehensive validation. Key considerations for validation include target label definition, metric selection, generalizability and applicability assessment, algorithmic fairness, and explainability (Fig. 1). Validation begins with robust study design.

Fig. 1.

Successful validation of machine learning pipelines requires evaluating applicability, performance metric selection, explainability, ground truth labels, algorithmic fairness [adapted from Azimi and Zaydman (10)], and decision threshold optimization, among others.

Open in new tab Download slide

Designing a Validation Study

Frameworks for validating ML pipelines (11–13) share many similarities with the assay validation studies laboratorians routinely perform. Ideally, a validation study would occur as a prospective trial of the full clinical work flow. However, such end-to-end validations require a substantial operational and technical investment in MLOps infrastructure. A practical approach involves a series of iterative validation cycles, each requiring incrementally larger investments. For example, a series of 3 iterations could focus on first validating model predictions offline and asynchronously, then the isolated technical pipeline within a development environment, and finally the full, end-to-end clinical implementation.

Prior to initiating a validation study, one must first critically evaluate the gold standard to which predictions will be compared, often called the target, or ground truth. Many ML applications are built using a supervised learning framework, where the models are explicitly trained to differentiate inputs using a series of class labels. Imperfect training labels can lead to misguided implementations, so it is crucial to address these issues upfront. In a typical validation study, the models—and their training labels—will be immutable. However, the baseline by which they are assessed is not. These baselines may include comparisons to prior gold standard methods, expert-defined rule-sets, clinical definitions, or even synthetic or simulated labels (14). The ground truth label used to assess predictions should be defined after thorough and collaborative analysis by data scientists, subject matter experts, and key stakeholders. The “Establishing Ground Truth” section of the Supplemental Website provides a practical approach to exploring various types of ground truth labeling strategies for saline contamination.

Effectively Measuring Performance

Performance assessment should capture both how well the model predictions match ground truth labels (discriminative performance), and how well the full pipeline satisfies the original need (implementation efficacy). This includes how downstream stakeholders interact with the predictions, and the likely consequences of incorrect predictions. Discriminative performance is assessed using metrics such as accuracy, sensitivity, and specificity. The clinical scenarios in which testing is performed may favor one metric over the others. For example, reflex testing protocols used in newborn screening (15) or infectious disease testing (16, 17) start with a highly sensitive screening assay to minimize false negatives. This often comes at an increased false-positive rate, so cases that screen positive by sensitive assays must be confirmed by more specific follow-up tests.

Most ML models produce continuous outputs, which must be discretized into class labels. One tool for mapping continuous model outputs onto discriminative metrics (e.g., sensitivity and specificity) is the receiver operating characteristic (ROC) curve. The area under this curve (auROC) is a useful single value for describing the extent to which the output successfully discriminates between positive and negative cases. However, accuracy, sensitivity, specificity, and the auROC can be misleading when classes are imbalanced (7). Consider a scenario where positive classes are extremely rare, such as intravenous fluid contamination or wrong-blood-in-tube errors. A model that predicts all results as negative will have a near-perfect accuracy of 0.999 and a perfect specificity at an error rate of 1 in 1000 but will not provide any clinical utility.

Metrics that incorporate the true negative rate (e.g., accuracy, specificity) will be inflated when negatives outweigh the positives, so it behooves us to highlight metrics that avoid the true-negative rate, such as positive predictive value (PPV), sensitivity, the F₁ score, and the precision-recall (PR) curve. Additionally, the Matthews correlation coefficient (MCC) represents an optimized, singular metric for performance assessment that incorporates all 4 quadrants of the confusion matrix while not misleading reviewers in settings of class imbalance (18, 19). These metrics will provide more faithful representations of clinical utility and implementation efficacy and should be reported in lieu of (or in addition to) accuracy and auROC. It is imperative that the framework by which one measures pipeline performance be a collaborative, well-informed effort between data scientists and subject matter experts to avoid blind spots between validation and implementation. The “Selecting Effective Performance Metrics” section of the Supplemental Website explores the effects of class imbalance on various performance metrics and highlights the advantages of the PR curve and MCC.

Optimizing Decision Boundaries

Converting continuous outputs to discrete class labels requires defining class boundaries, or decision thresholds. Visual inspection of ROC and PR curves can be performed but more objective solutions exist. If sensitivity and specificity are equally important, this can be optimally achieved by maximizing the Youden J index (20)—the threshold value at which the positive and negative classes are maximally differentiated. However, clinical scenarios in which the consequences of falsely positive and negative results are equal are exceedingly rare. A more nuanced approach, cost-sensitive learning, involves defining separate penalties for false positives and negatives (21). When the relative significance of false positives and negatives can be accurately estimated, cost-sensitive learning is a powerful tool for optimizing real-world performance. However, these relative significances are often difficult to objectively define, and misattributing them may lead to poor downstream performance.

Alternatives to a single decision boundary include reporting confidence intervals, and/or post-test probabilities. These approaches have clear advantages but suffer from clinicians’ ability to appropriately interpret these probabilistic concepts (22). Another practical approach to improving predictive performance is the use of an equivocal zone [e.g., decision limits for N-terminal brain natriuretic peptide in the work-up of congestive heart failure (23)]. In its most conservative form, one could aim for perfect positive and negative predictive values by assigning the lowest output from the positive class as the threshold for assigning a negative label, the highest output from the negative class as the threshold for assigning a positive label, and reporting everything between as “equivocal.” While this approach can improve the fidelity of the predictions when they are made, it comes at the trade-off of decreasing clinical utility as the proportion of inputs without a meaningful prediction increases. Exploring the nuances of this trade-off is another important opportunity for collaboration between data scientists and subject matter experts. The “Optimizing Decision Boundaries” section of the Supplemental Website explores the trade-offs between PPV, NPV, and prediction rate with and without equivocal zones.

Recognizing Anomalous Inputs

Regardless of how the predictions are reported, an effective ML pipeline must also be able to recognize when no prediction should be made in the first place. Applicability domain assessment, or applicability, refers to the identification of inputs that are too different from the training data for the pipeline to render a reliable prediction. Without explicit applicability assessment, any numeric input will yield a model output. For example, if the inputs to a model are being manually transcribed, and a typographical error leads to a calcium concentration of 120 mg/dL instead of 12.0 mg/dL, a prediction should not be made, as the model has likely never been trained on such extreme, nonphysiologic results.

In its simplest form, applicability assessment can take the form of upper and lower limits for each input, much like feasibility limits within a laboratory information system. However, a skilled data scientist may be able to add robustness to subtle patterns of divergence across multiple analytes by applying approaches such as Mahalanobis distance or principal component analysis. See the “Applicability Assessment” section of the Supplemental Website for more detail.

Measuring Algorithmic Fairness

While ML pipelines represent an immense opportunity for laboratory medicine, they are also fraught with the potential to exacerbate existing healthcare disparities. Azimi and Zaydman (10) highlight the vulnerabilities present in each step of the ML lifecycle, from problem definition to performance monitoring. We will highlight one aspect particularly relevant to ML validation—fairness concepts and their associated metrics.

During validation, analyzing predictive performance with respect to clinical and demographic subgroups may help avert dangerously inequitable implementations (24). This performance can be contextualized within 3 concepts of group fairness: demographic parity, where predictions show no dependence on subgroup membership; predictive parity, where PPV and NPV are identical across subgroups; and equalized odds, where sensitivity and specificity are identical across subgroups (10). See the “Algorithmic Fairness” section of the Supplemental Website for more detail.

As no informative classifier can satisfy more than 2 of these in a situation where incidence varies across subgroups, it is crucial that the results of these validation studies are discussed with a multidisciplinary group of key stakeholders so that the ultimate decisions over equity trade-offs can be as informed and responsible as possible.

Explaining Models and Predictions

A pipeline whose predictions can be meaningfully interpreted or explained may be easier to validate, troubleshoot, and implement. While ML has long been saddled with the “black box” moniker, advances in explainability techniques have paved the way for granular interrogation of models and predictions. This takes 2 main forms: model-level, or global explainability, and prediction-level, or local explainability. While global explainability is helpful in development and validation, it is local, prediction-level explainability that may play a critical role in implementation, and how clinical applications of ML are regulated (25). The “Explaining Models and Predictions” section of the Supplemental Website dives deeper into these techniques and their common implementations.

Explainability methods can instill confidence when they conform to our a priori hypotheses, but it is important to note that both the explanations and our own, ad hoc estimates of feature importance are never perfect. Explainability techniques may shed light on associations that even expert observers may miss, as observed in models that can reliably predict sex from retinal scans (26). However, a biased model's explanations may lead to automation bias (27) and reduced overall performance (28) for human-in-the-loop approaches. As such, if explainability is a prerequisite for clinical implementation, it will be crucial to validate the explanations in parallel with the predictions. However, the optimal framework for validating and implementing explainability within ML pipelines is a nascent and rapidly evolving field (29), especially within laboratory medicine applications.

Implementation: Deploying ML Pipelines within Production Environments

Once a pipeline has been robustly validated and the decision to implement is made, the laboratorian's role is far from over. Effective implementation requires tremendous collaboration across stakeholder groups with diverse backgrounds, skill sets, and priorities to ensure that clinical, operational, and technical needs are met. While the implementation details and the specifics of the laboratorian's role will vary by institution, we will highlight some near-universal terms and concepts.

Key Roles and Responsibilities

Figure 2 presents the common roles that an effective, multidisciplinary team may feature. It is important to highlight that these roles are not exclusive, nor are they an exhaustive list. Often, individuals will fill multiple roles at any given time. This is particularly true for institutions early in the maturation of their informatics infrastructure.

Fig. 2.

Successful implementation of machine learning pipelines requires a collaborative, multidisciplinary effort within an organization. The key roles, responsibilities, and terminologies are described in this figure.

Open in new tab Download slide

Governance refers to the processes by which organizational responsibilities and decisions are divided, evaluated, and executed. In the context of machine learning operations, governance involves the rules, practices, and frameworks that ensure responsible implementation. One tool for exploring governance structures is a RACI diagram (30), in which each stakeholder in a project is assigned one level of Responsible, Accountable, Consulted, or Informed.

“Deploying to Production:” Technical Terms and Concepts

The ultimate goal for implementation is to deploy a model into the institution's production environment. The collection of best practices for achieving this are referred to as Development Operations (DevOps), or MLOps in the context of ML solutions (31). Given the collaborative nature of these projects, laboratorians will find themselves in meetings where technical details are being discussed. Familiarizing ourselves with some key concepts in MLOps may help us make meaningful contributions when an implementation veers off-course.

Deploying a model means making it accessible to other computers within a network. Deployment is often made much simpler by packaging the model alongside the instructions and tools and code required to run it within an isolated, self-sufficient unit called a “container.” The containerized model can then be hosted on a server, or a dedicated computer that runs the model. Server hosting decisions typically aim to balance total costs with desirable features such as performance (how quickly can predictions be generated, measured in “latency”), reliability (what percent of the time is the model available for predictions, measured in “uptime”), scalability (how quickly and easily the server can adapt to changes in request load), and security (how vulnerable the server is to breaches). Laboratorians are likely to have evaluated aspects of latency, uptime, and scalability when considering turnaround time, redundancy, and downtime procedures in evaluating new assays, automation lines, or instrumentation within the laboratory.

Once a server solution has been developed, it must be integrated within a final prediction pipeline. This typically involves an application-programming interface (API), which takes a series of inputs from a data stream and returns the corresponding model predictions alongside any associated metadata. If the model is being implemented in a fully autonomous fashion, these predictions can be sent to affect downstream processes, much like autoverification rules within the laboratory information system. However, with a human-in-the-loop approach, the user typically interacts with the model through a client, or user interface.

Best practices suggest performing this integration in an environment that is completely isolated from the live, clinical infrastructure (32). These stand-alone testing grounds are often referred to as development environments and serve the role of a risk-free “sandbox” in which any changes can be battle-tested before they are incorporated into the live “production” environment.

Monitoring the Deployed Model: Quality Assurance for ML Solutions

Much like the analyzers inside the laboratory, it is typical for ML model performance to degrade over time. Moreover, the clinical context in which these predictions are being made often evolves over time. Together, these factors necessitate a robust framework for monitoring and updating ML pipelines, which can be contextualized through comparison to best practices for quality assurance and quality control (QA/QC) of laboratory results (33) (Fig. 3).

Fig. 3.

Successful monitoring of machine learning pipelines requires a robust infrastructure for detecting drifting inputs—termed data drift, or drifting target labels—termed concept drift. When performance degradation is observed, the deployed model can be updated.

Open in new tab Download slide

Causes of Performance Degradation

The DevOps correlate to conventional QC is called unit testing, where known sets of inputs must produce a predictable output in order to go live. However, unlike analytical QC, decreases in ML model performance are not due to any technical or mechanical degradation but occur when the real-world data and environment in which it is used diverges—or drifts—from that upon which it was trained. There are 2 primary types of model drift: data drift and concept drift.

Data drift occurs when the input data changes over time. This can be gradual (e.g., increasing bias or imprecision in an uncalibrated analyzer), or sudden (e.g., a new instrument being introduced). In cases where sudden shifts are known ahead of time, inputs can be “re-baselined”, or models can be retrained to minimize disruptions. Concept drift occurs when the outcome being predicted or the clinical context in which predictions are made changes over time. For example, a model to predict saline contamination in laboratory specimens may lose PPV if fewer samples are drawn directly from catheters, or if the hospital shifts its utilization towards a different fluid type.

Detecting and Correcting Performance Degradation

Detecting these drifts and correcting them in a safe and efficient manner requires a robust framework and infrastructure for monitoring real-time performance. An optimal approach requires establishing a closed-loop feedback system that calculates performance metrics in real-time and uses a set of predefined thresholds to trigger an alert or corrective action. However, this requires that a ground truth label be assigned for each prediction, which adds a significant operational burden if no workflow is already in place. An alternative is an approach that parallels laboratory moving averages and patient-based real-time quality control (PBRTQC) (34, 35), where instead of continuously assessing performance metrics such as accuracy, the predicted probabilities or positivity rates (36) are monitored for fluctuations beyond statistical noise.

Detecting a degradation in performance is only helpful if an appropriate action plan can be implemented to correct it. Ideally, these action plans will have been defined upfront with input from several stakeholders across the entire workflow and will result in minimal disruption in the availability of pipeline predictions. Achieving a working state where updates occur seamlessly requires a robust infrastructure in place to continuously build, update, integrate, and deploy new models before issues arise. This paradigm is known in the software development life cycle as continuous integration and continuous deployment (CI/CD) (37).

One common approach for proactively updating models is the “champion–challenger” approach (31), where the performance metrics from the current model (“champion”) are routinely compared to those from a series of prospective models (“challengers”). These challengers can range from minor pipeline modifications, such as “correcting” the input data so that it is similar to the original training data set, to entirely new models that have been retrained on more recent data. It is worth noting that the regulatory framework surrounding the modification and updating of ML pipelines is in a transitional period, with the United States Food and Drug Administration (FDA) issuing a draft guidance document surrounding predetermined change control plans for AI-enabled medical devices (38).

The Evolving Regulatory Framework for Laboratory-Developed ML Pipelines

The recently published final rule for the FDA regarding laboratory-developed tests (39) adds uncertainty to the future of laboratory-developed models and pipelines. While currently within a regulatory void, the European Union's Artificial Intelligence Act (40) and the FDA's statements on AI-enabled clinical decision support (CDS) and Software as a Medical Device (SaMD) has provided glimpses into what the future may hold (41). The AI-enabled device and SaMD pathways may represent insurmountable barriers for small, individual laboratories. However, treating these models as AI-enabled clinical decision support tools may allow individual laboratories an opportunity to have their approach evaluated by the FDA without incurring excessive costs. To conform to the definition of a clinical decision support tool (and not a medical device), the pipelines must avoid incorporating the raw signaling data from instruments, synthesize information that is already available to an adequately motivated clinician, and serve to augment, but not replace, the decision-making process in an explainable manner. It is possible that a subset of applications for ML pipelines could avoid medical device regulation through the CDS pathway, but it is likely that many valuable solutions will require applications to more stringent and resource-intensive regulatory pathways. Additionally, it should be noted that many institutions offer resources for facilitating regulatory compliance. These offices should be intimately involved in the planning stages of ML applications to minimize risks.

Regardless of the nuances of regulatory approval pathways, it is crucial that these pipelines be developed, validated, and implemented safely and effectively. This includes adhering to the principles of good machine learning practice (31) and maintaining a competent and resolute internal commitment to responsible ML. The regulatory framework for laboratory-developed ML will continue to evolve, and laboratorians will need to adapt accordingly. Responsible ML and its regulation by government agencies will be an area worthy of close attention as these approaches become more commonplace within laboratory work flows.

Conclusion

While ML applications hold significant promise for improving patient care and streamlining laboratory operations, their full potential remains largely untapped. The decreasing technical barrier for developing ML pipelines presents an opportunity for laboratorians to play a pivotal role in converting theoretical promise into practical utility, but to do so, laboratorians must gather the necessary expertise to make meaningful contributions to the effective and responsible validation, implementation, and monitoring of these solutions.

Supplemental Material

Supplemental Information is permanently archived at Zenodo (https://zenodo.org/doi/10.5281/zenodo.12785723) and available from GitHub (https://nspies13.github.io/) to provide more comprehensive and practical detail.

Nonstandard Abbreviations

ML, machine learning; AI, artificial intelligence; MLOps, machine learning operations; PPV, positive predictive value.

Author Contributions

The corresponding author takes full responsibility that all authors on this publication have met the following required criteria of eligibility for authorship: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved. Nobody who qualifies for authorship has been omitted from the list.

Authors’ Disclosures or Potential Conflicts of Interest

Upon manuscript submission, all authors completed the author disclosure form.

Research Funding

None declared.

Disclosures

C.W. Farnsworth has received consulting fees from Abbott, Werfen, and CytoVale, payment or honoraria from Roche, Abbott, Siemens, Beckman Coulter, Cepheid, and Biomerieux, and is the 2024–2025 SYCL Liaison for Clinical Chemistry.

References

Jiang

Cheng

Tao

Liu

, et al.

Expert-level immunofixation electrophoresis image recognition based on explainable and generalizable deep learning

Clin Chem

2023

;

130

–

Month:	Total Views:
September 2024	586
October 2024	201
November 2024	516
December 2024	248
January 2025	195
February 2025	186
March 2025	212
April 2025	169

Article Contents

Validating, Implementing, and Monitoring Machine Learning Solutions in the Clinical Laboratory Safely and Effectively

Abstract

Introduction

Validation: Effective Appraisal of ML Solutions

Designing a Validation Study

Effectively Measuring Performance

Optimizing Decision Boundaries

Recognizing Anomalous Inputs

Measuring Algorithmic Fairness

Explaining Models and Predictions

Implementation: Deploying ML Pipelines within Production Environments

Key Roles and Responsibilities

“Deploying to Production:” Technical Terms and Concepts

Monitoring the Deployed Model: Quality Assurance for ML Solutions

Causes of Performance Degradation

Detecting and Correcting Performance Degradation

The Evolving Regulatory Framework for Laboratory-Developed ML Pipelines

Conclusion

Supplemental Material

Nonstandard Abbreviations

Author Contributions

Authors’ Disclosures or Potential Conflicts of Interest

Research Funding

Disclosures

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only