-
PDF
- Split View
-
Views
-
Cite
Cite
Paul D Simonson, Yue Wu, David Wu, Jonathan R Fromm, Aaron Y Lee, De Novo Identification and Visualization of Important Cell Populations for Classic Hodgkin Lymphoma Using Flow Cytometry and Machine Learning, American Journal of Clinical Pathology, Volume 156, Issue 6, December 2021, Pages 1092–1102, https://doi.org/10.1093/ajcp/aqab076
- Share Icon Share
Abstract
Automated classification of flow cytometry data has the potential to reduce errors and accelerate flow cytometry interpretation. We desired a machine learning approach that is accurate, is intuitively easy to understand, and highlights the cells that are most important in the algorithm’s prediction for a given case.
We developed an ensemble of convolutional neural networks for classification and visualization of impactful cell populations in detecting classic Hodgkin lymphoma using two-dimensional (2D) histograms. Data from 977 and 245 clinical flow cytometry cases were used for training and testing, respectively. Seventy-eight nongated 2D histograms were created per flow cytometry file. Shapley additive explanation (SHAP) values were calculated to determine the most impactful 2D histograms and regions within histograms. SHAP values from all 78 histograms were then projected back to the original cell data for gating and visualization using standard flow cytometry software.
The algorithm achieved 67.7% recall (sensitivity), 82.4% precision, and 0.92 area under the receiver operating characteristic. Visualization of the important cell populations for individual predictions demonstrated correlations with known biology.
The method presented enables model explainability while highlighting important cell populations in individual flow cytometry specimens, with potential applications in both diagnosis and discovery of previously overlooked key cell populations.
• A new machine learning algorithm for flow cytometry interpretation is presented that focuses on algorithm explainability and visualization.
• Using new explainability tools for machine learning (ie, Shapley additive explanation values), we estimate the impact of individual cells on diagnosis prediction for individual cases and highlight the relevant cells.
• The presented methods make it possible for pathologists to better understand how flow cytometry algorithms perform in general and for individual cases, with visualization using standard flow software.
Classic Hodgkin lymphoma (cHL) is a B-cell lymphoma composed of neoplastic Hodgkin cells and multinucleated Reed-Sternberg cells in a background of nonneoplastic, reactive immune cells. The disease often affects younger individuals and commonly demonstrates mediastinal involvement.1,2 Given the sensitive location of involvement and high frequency of nonneoplastic causes of lymphadenopathy, minimally invasive, small-needle core biopsies and fine-needle aspirations are often the first biopsies performed. The small biopsy sizes and relatively sparse distribution of neoplastic cells reduce the sensitivity for detection of cHL by morphology.3,4 In addition to the unique diagnostic information the flow cytometry assay provides when the morphologic diagnosis is unclear, it is relatively inexpensive to perform and can be completed in a matter of hours, which can be particularly helpful in emergency situations such as superior vena cava syndrome. Fortunately, increased diagnostic sensitivity can be achieved by adding flow cytometry to the armamentarium, and flow cytometry methods for the diagnosis of cHL have been demonstrated in the clinical laboratory environment.5-7
Despite this, flow cytometry for cHL is not widely available, possibly due to various technical factors and the lack of test interpretation experience.8 The relative infrequency of positive samples and lack of interpretation experience make validation of the assay challenging for many laboratories. Application of machine learning to flow cytometry data has the potential to reduce data interpretation subjectivity and increase accuracy in the interpretation of cHL flow cytometry data.9,10 In a prior report,8 random forest and support vector machine classifiers were used to classify cHL flow cytometry data. Various approaches to explain the inner workings of the models were attempted, including random forest tree analysis, principal component analysis, and accuracy comparisons when data for various antibodies were excluded. These attempts at model explanation suggested that cell populations other than Hodgkin cells (including CD5-positive cells) were important in making the classification prediction. However, details regarding predictive cell populations were lacking. Furthermore, the approaches were incapable of determining which cell populations were most impactful for making predictions for individual flow cytometry cases. Visualization of impactful cell populations for individual cases was therefore also impossible.
Recent work in machine learning explainability has resulted in new tools for understanding the general features that are important for a model, as well as tools for identifying the impact of individual data elements on individual predictions.10-14 One such tool, the calculation of Shapley additive explanation (SHAP) values, allows the identification of impactful data features on predictions, both generally and for individual predictions, by estimating the additive impact of each input data point in arriving at a classification prediction.11,13
Building on prior work, we sought to leverage the spatial information in histograms and build a machine learning framework with the explicit goal of interpretability and identification of important classes of cells from flow cytometry data.
Materials and Methods
Patient Samples
In total, 1,222 samples were analyzed using a nine-color cHL-specific flow cytometry panel and used in accordance with the university’s institutional review board (IRB) approval. See Table 1 for sample sex and age distributions. Given that the retrospective use of the data is for clinical laboratory test quality and operations improvement and there is minimal risk for patient harm, patient written consent was deemed unnecessary and therefore waived by the IRB of the university. All methods were carried out in accordance with relevant guidelines, regulations, and approval by the IRB of the university. Cases from 2010 to 2019 were used, and there was no significant alteration to the antibody panel or protocol during that time frame. Cases were identified by searching for available raw flow cytometry data files (FCS files) that corresponded to the antibody panel of interest (“Hodgkin panel”). To maximize generalizability, no upfront exclusion criteria were applied. Corresponding interpretations of data in clinical reports were reviewed to create annotations to train a binary classifier. Cases that were reported as “consistent with,” “highly suggestive of,” or other similar language for cHL were annotated as positive (n = 149). Cases that were interpreted as “suspicious for” cHL or similar language were also annotated as positive (n = 175), although a further review of the medical record was performed for these cases to determine the predictive accuracy of the interpretation. Seventy cases were thus reviewed using concurrent/follow-up histology-based diagnostic results. In three cases, the result was negative for cHL, and those three cases were reannotated as negative for training purposes (leaving n = 172). A small number of cases were reported as consistent with the Hodgkin component of gray zone lymphoma; these were also annotated as positive for the purposes of our classifier. Total positive cases were then 321, and total negative cases were 901. Of the negative cases, 241 cases were interpreted as consistent with or suspicious for a neoplasm other than cHL, and 123 of the negative cases had some language included that suggested there was something suboptimal regarding the submitted sample (few cells, degradation, etc); these cases were not excluded from the training or test data sets so as to improve generalizability. Total training cases were 977 (259 positive, 718 negative), and testing cases were 245 (62 positive, 183 negative). Training cases were further subdivided into training set 1 (654 total, 168 positive, 486 negative) and training set 2 (323, 91 positive, 232 negative) Figure 1.
Sample Characteristics for All Samples, Positive Samples, and Negative Samples
Characteristic . | Value . |
---|---|
All samples (n = 1,222) | |
Sex, No. (%) | |
Male | 671 (55) |
Female | 528 (43) |
Unknown | 23 (2) |
Age, y | |
Mean, median, SD | 46, 44, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Positive samples (n = 321) | |
Sex, No. (%) | |
Male | 175 (55) |
Female | 135 (42) |
Unknown | 11 (3) |
Age, y | |
Mean, median, SD | 41, 37, 18 |
Range | 11 to 93 |
Age unknown | No samples |
Negative samples (n = 901) | |
Sex, No. (%) | |
Male | 496 (55) |
Female | 393 (44) |
Unknown | 12 (1) |
Age, y | |
Mean, median, SD | 48, 47, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Characteristic . | Value . |
---|---|
All samples (n = 1,222) | |
Sex, No. (%) | |
Male | 671 (55) |
Female | 528 (43) |
Unknown | 23 (2) |
Age, y | |
Mean, median, SD | 46, 44, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Positive samples (n = 321) | |
Sex, No. (%) | |
Male | 175 (55) |
Female | 135 (42) |
Unknown | 11 (3) |
Age, y | |
Mean, median, SD | 41, 37, 18 |
Range | 11 to 93 |
Age unknown | No samples |
Negative samples (n = 901) | |
Sex, No. (%) | |
Male | 496 (55) |
Female | 393 (44) |
Unknown | 12 (1) |
Age, y | |
Mean, median, SD | 48, 47, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Sample Characteristics for All Samples, Positive Samples, and Negative Samples
Characteristic . | Value . |
---|---|
All samples (n = 1,222) | |
Sex, No. (%) | |
Male | 671 (55) |
Female | 528 (43) |
Unknown | 23 (2) |
Age, y | |
Mean, median, SD | 46, 44, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Positive samples (n = 321) | |
Sex, No. (%) | |
Male | 175 (55) |
Female | 135 (42) |
Unknown | 11 (3) |
Age, y | |
Mean, median, SD | 41, 37, 18 |
Range | 11 to 93 |
Age unknown | No samples |
Negative samples (n = 901) | |
Sex, No. (%) | |
Male | 496 (55) |
Female | 393 (44) |
Unknown | 12 (1) |
Age, y | |
Mean, median, SD | 48, 47, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Characteristic . | Value . |
---|---|
All samples (n = 1,222) | |
Sex, No. (%) | |
Male | 671 (55) |
Female | 528 (43) |
Unknown | 23 (2) |
Age, y | |
Mean, median, SD | 46, 44, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |
Positive samples (n = 321) | |
Sex, No. (%) | |
Male | 175 (55) |
Female | 135 (42) |
Unknown | 11 (3) |
Age, y | |
Mean, median, SD | 41, 37, 18 |
Range | 11 to 93 |
Age unknown | No samples |
Negative samples (n = 901) | |
Sex, No. (%) | |
Male | 496 (55) |
Female | 393 (44) |
Unknown | 12 (1) |
Age, y | |
Mean, median, SD | 48, 47, 19 |
Range | 0 to 98 |
Age unknown | 1 sample |

Identification of cases to be labeled as positive or negative. All identified cases from 2010 to 2019 with corresponding flow cytometry reports are used. cHL, classic Hodgkin lymphoma.
Flow Cytometry and Manual Classification
Using a method previously described by Fromm et al,5,6 cases were evaluated on a modified four-laser, 10-color Becton Dickinson LSRII flow cytometer using the following laser-fluorochrome combinations: (1) a 405-nm violet laser (one color) exciting Pacific blue (PB); (2) a 488-nm blue laser exciting fluorescein isothiocyanate (FITC), phycoerythrin (PE), PE-Texas red (ECD/PE-TR), PE-Cy5.5, and PE-Cy7; and (3) a 633-nm red laser (three colors) exciting allophycocyanin (APC), APC–Alexa Fluor 700 (APC-A700), and APC-Cy7. The specific fluorescently labeled antibodies consisted of CD95-PB, CD64-FITC, CD30-PE, CD5-ECD, CD40-PECy5.5, CD20-PECy7, CD15-APC, CD71-APC-A700, and CD45-APC-Cy7, which, in addition to four light-scatter properties (forward and side scatter, area, and height), generated a total of 13 dimensions of data. Typically, 0.5 to 1 million events were collected per case, although cases with as few as 20,000 cells were also seen, and sequential gating was used to select for discrete populations of Hodgkin/Reed-Sternberg (HRS) cells as previously described.5,6
Construction of Two-Dimensional Histograms
Software was written using Python 3.715 along with external Python modules that included fcsparser,16 sklearn,17 tensorflow18 (version 2.0, with included keras19 framework), matplotlib,20 seaborn,21 numpy,22 pandas,23 and shap.12,13 Flow cytometry FCS files were imported via fcsparser and compensated for fluorescence channel crosstalk. All values, except for forward scatter values, that were less than 1.0 were then converted to 1.0, and the logarithm base 10 was applied. These values were then divided by 12.5, which achieved approximate normalization. Forward scatter values were rescaled on a linear scale with modified offset and approximate normalization to make better use of the range [0,1]. Two-dimensional (2D) histograms were then calculated for each possible pairwise combination of parameters (9 fluorescence parameters and 4 scattering parameters, making 13 total parameters, with 78 possible pairwise combinations). The 2D histograms were modified by adding 1.0 to each value, then applying a logarithm (base 10) function. Each histogram was then scaled by its maximum bin value to normalize each histogram individually. The histogram binning resolution was 50 × 50 on the interval [0, 1]. The data were divided into training and test data sets (80:20 ratio).
EnsembleCNN Classifier
A convolutional neural network (CNN) classifier was created to correspond to each previously constructed 2D histogram (see above); 78 were created in total. Each CNN classifier was constructed as sequential layer models using the keras framework in tensorflow 2.0 Figure 2. The prediction probabilities of cHL positivity for each CNN were passed as an ensemble of parameters to a random forest classifier, which gave the final prediction for each flow cytometry case (Figure 2). Training set 1 (see Patient Samples above) was used to train the CNNs, and training set 2 was used for training the ensemble random forest classifier. To achieve approximately class-balanced data, the positive cases in training set 1 were duplicated twice (ie, 168 positive cases were duplicated twice to make a set of 504 positive cases, combined with the 486 negative cases, and shuffled to result in 990 cases to be used for training in each epoch). The CNNs were trained by using batch sizes of 200 from training set 1 for 200 epochs, with L2 regularization and parameter dropout layers to avoid overfitting. Evaluation for overfitting by CNNs was performed by using training set 2 as a validation set, and plots of training and validation data loss and accuracy vs epoch were inspected for overfitting.

Classification model architecture. A, Overview of the convolutional neural network (CNN) ensemble classifier. A flow cytometry data file is used to construct two-dimensional histograms, which are then passed as inputs to an ensemble of corresponding CNNs. The outputs are passed to an integrating random forest classifier, which produces a final prediction score. B, Architecture used for individual CNNs. Colored arrows indicate operations performed in the various layers.
The trained CNNs were applied to training set 2 to produce the inputs for training the random forest classifier. The ensemble random forest classifier was implemented using sklearn’s RandomForestClassifier class with n_estimators = 1,000, max_leaf_nodes = 32, and balanced class weighting.
Evaluation Metrics, Including SHAP Values
A confusion matrix was generated for the test predictions and predicted classes using sklearn.24 R,25 RStudio,26 and the precrec library27 were used to plot the receiver operating characteristic (ROC) curve and precision-recall curve seen in Figure 3. Confidence intervals were generated by pooling the training and test set data, then performing fivefold cross-validation, with the resulting predictions used to estimate the 95% confidence intervals using the precrec library.

Receiver operating characteristic (ROC) curves and precision-recall curves for the CNN ensemble classifier. A, ROC curve, constructed using class predictions for reserved test data. Area under the curve = 0.92. B, Precision-recall curve. Shaded areas represent estimated 95% confidence intervals.
SHAP values were calculated using the shap Python module13 to estimate the impact of CNN outputs on the final prediction produced by the random forest classifier. SHAP values were also calculated for individual bins within the 2D histograms. In addition, the SHAP values from the 2D histograms were projected back to the individual cells in the original FCS files by summing the SHAP values from every bin within which a particular cell of the given FCS file was found. Three tallies for each cell were calculated: total SHAP value, positive SHAP values (only add the bin’s SHAP value if it is positive), and negative SHAP values (only add the value if it is negative). A new FCS file was then created by adding the tallies as new parameters for each cell. The new FCS file was then used for gating and visualization purposes in clinical flow cytometry analysis software (WoodList 3.1.3, written by Brent L. Wood). See Figure 4 for example output.

Projection of two-dimensional (2D) histogram Shapley additive explanation (SHAP) values from all 78 2D histograms back to individual cells, plotted using clinical flow cytometry software. The highlighted cells indicate the cell populations that were most impactful in determining the convolutional neural network predictions. It is important to note that while some cells are highlighted as being most impactful, the classifier does use all of the cells in its prediction. A, B, Gating of cells by summed SHAP values. Cells highlighted in blue had increased summed positive SHAP values, cells in magenta had very increased summed positive SHAP values, and cells in orange had increased summed negative SHAP values. C, D, Projection of cells along forward-scatter and side-scatter axes, with (C) and without (D) superimposed negatively impacting cells highlighted. The cells along the left edge of the plot are consistent with nonviable cells. E, F, Gated viable cells, with (E) and without (F) negatively impacting cells highlighted. Most of the clusters of impactful cells are consistent with subpopulations of lymphocytes. Data are from an example case with predicted cHL+ probability more than 0.95. cHL, classic Hodgkin lymphoma.
Results
Our machine learning framework is composed of an ensemble of CNNs that each make independent predictions based on 2D histograms as inputs. The predictions are passed to a random forest classifier for final prediction. The ensemble of CNN classifiers (EnsembleCNN) approach is analogous to manual diagnostic approaches already used by hematopathologists and others and is therefore intuitive to understand. We use SHAP values to highlight the impactful 2D histograms and regions within 2D histograms.
EnsembleCNN Classifier Results
Training of the CNNs (Figure 2 and Supplementary Figure 1; all supplemental materials can be found at American Journal of Clinical Pathology online) demonstrated that the majority achieved more than 70% validation accuracy. Individual CNN predictions also generally correlated among themselves for a given flow cytometry specimen, as is well demonstrated in the hierarchical clustering dendrogram plot shown in Figure 5. These findings suggest that detectable, predictive information is in fact evident to the CNNs in most of the nongated 2D histograms (see below). The EnsembleCNN classifier resulted in a classification accuracy of 88.2%, a precision of 82.4%, a recall (sensitivity) of 67.7%, and an F1 score of 74.3%. (The full confusion matrix is available in Supplementary Table 1.) The area under the ROC curve was 0.92 (Figure 3). Subsequent fivefold cross-validation demonstrated similar results, with resulting generation of 95% confidence intervals in the ROC and precision-recall curves plotted in Figure 5.
![Hierarchical dendrogram of class predictions produced by convolutional neural networks (CNNs) applied to the test data set (245 cases). Rows correspond to individual flow cytometry cases, and each has 78 class predictions, one for each of the CNNs. Row labels (left side of plot) correspond to data labels (blue = positive for classic Hodgkin lymphoma [cHL], red = negative). The plot demonstrates that for a single row, the predictions of the various CNNs tend to correlate among themselves and also with the annotated data labels.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/ajcp/156/6/10.1093_ajcp_aqab076/2/m_aqab076_fig5.jpeg?Expires=1750074032&Signature=zTsb3Z0kqARXJdEc7HHrX4XnHceHaFN-PKqfD52n1Ckvvhj2WELCqcCz4q6x9MzzXL5IOnS5pKfE97vr3CF7-V6jcL~jfrD4KO0lfuvt1EjddH2WpmYhxpVdy5GnKJ8h-X7w-P7UzfSlTvDVOqnOJge6W0Jwvkl3p5INjcgG2Uqb0fjSX1xyND0nXWZgsbm~~E6akCuIjSdIS2HSuX15K4RUw-e7y0DCMySXFTHSKByeq9XPtqMcfyHsuMkyrcyE36oYHW2tShMksv2ywr1ZnqF27H4vT7jLpz7XaPbgTsC43y6O-uzrSAi-8A4Ftx~wFU7OBfVCg5Dyz4fwRL3omQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Hierarchical dendrogram of class predictions produced by convolutional neural networks (CNNs) applied to the test data set (245 cases). Rows correspond to individual flow cytometry cases, and each has 78 class predictions, one for each of the CNNs. Row labels (left side of plot) correspond to data labels (blue = positive for classic Hodgkin lymphoma [cHL], red = negative). The plot demonstrates that for a single row, the predictions of the various CNNs tend to correlate among themselves and also with the annotated data labels.
SHAP Values Identify the Most Impactful 2D Histograms
Given the modular framework of our EnsembleCNN classifier, we can calculate SHAP values for the various components to understand the contributions/impact of each part. Given that predictions are calculated for each 2D histogram and passed to an integrating random forest classifier (Figure 2), we calculated SHAP values for each prediction passed to the random forest classifier, with the purpose of identifying the most impactful 2D histograms. The following five were found to be the most impactful: CD40/CD15, CD20/CD71, SSC-H/CD20, CD5/CD15, and SSC-A/CD20. The SHAP values for the CNN outputs for the top 20 histograms are shown in Figure 6. The SHAP values not only allow the estimation of the overall importance of the different 2D histograms but also yield additional insights into the behavior of the classifier. For example, the FSC-A (forward scatter) vs CD30 histograms demonstrated little impact on the random forest prediction scores for low CNN prediction values. However, when the CNN predictions were high, there was a much larger impact on the overall prediction score. Two-dimensional dendrogram plots (Figure 5 and Supplementary Figures 2 and 3) also highlight patterns of correspondence between data classification, individual histogram prediction scores, and SHAP values.
![Shapley additive explanation (SHAP) values for top 20 most impactful features for the integrating random forest classifier. Each feature is a prediction probability score (range = [0, 1]), produced by the convolutional neural networks (CNNs) for the corresponding two-dimensional (2D) histograms using test set data. Dots correspond to individual flow cytometry cases in the test data set. Of interest is the strong positive impact on final prediction that the FSC-A vs CD30-PE prediction score has when it has a high value but little negative impact when it has a low value for the given cases. See also Supplementary Figure 2, which presents the SHAP values for the various 2D histograms as a 2D hierarchical dendrogram.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/ajcp/156/6/10.1093_ajcp_aqab076/2/m_aqab076_fig6.jpeg?Expires=1750074032&Signature=iaVRt8xdvF8UmoMrl9n3YNl8eS8mBulCc7rnJSx3k2CVgcntWrBdgdseTbEGriEb-AHfVkg4x6yBu-TMa~l7GK~Nc39qWSmPCmPhbKlFdWoGIv3L7fmiyQMKnPmJ1nW1gL8CkD7YUUCkQkA6B6~OPkPHICSeNsfzp8Bx1yLZ9cfqnEzfruuoLLe5r2tnulxPXabtd9WANpMOlpjekykhhnRCoiSTpUvQ5byKI~dRGPj-RkKHyNVWj2p~CU0nvy2NUgARWjotpbDA7s8oITVloNRdoUnt11E8uLlW7BjOrRvdbQykiPgyY36-hIoa06R2p0TZb0Fe2nvj1A7r5YK8fA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Shapley additive explanation (SHAP) values for top 20 most impactful features for the integrating random forest classifier. Each feature is a prediction probability score (range = [0, 1]), produced by the convolutional neural networks (CNNs) for the corresponding two-dimensional (2D) histograms using test set data. Dots correspond to individual flow cytometry cases in the test data set. Of interest is the strong positive impact on final prediction that the FSC-A vs CD30-PE prediction score has when it has a high value but little negative impact when it has a low value for the given cases. See also Supplementary Figure 2, which presents the SHAP values for the various 2D histograms as a 2D hierarchical dendrogram.
SHAP Values Highlight the Most Impactful Bins Within Individual Histograms
SHAP values were also calculated for individual 2D histogram bins to identify the regions of most impact within the 2D histograms. Example SHAP value plots for 2D histograms are shown in Figure 7 and Figure 8. In Figure 7, the SHAP values are averaged over many predictions to highlight the recurrently most important parts of the 2D histograms. In Figure 8, we plot the SHAP values specific to individual flow cytometry cases, which can be used to explain the models’ predictions on a case-by-case basis. The plotted SHAP values can be used to correlate directly with 2D histograms that pathologists and researchers are already familiar with, making for better model explainability. Most of the inspected 2D SHAP plots highlight multiple impactful regions, suggesting that in fact, the presence, absence, or altered ratios of multiple cell populations play a role in the classification of cHL for these CNNs.

Examples of averaged (n = 100) absolute values of Shapley additive explanation (SHAP) values for two-dimensional (2D) histograms. Axes are numbered in terms of 2D histogram bin numbers. Left panels, cHL–; center panels, cHL+; and right panels, average (SHAP values). A, CD40 vs CD15. B, CD20 vs CD71. C, SSC-H (side scatter) vs CD20. cHL, classic Hodgkin lymphoma.

Example Shapley additive explanation (SHAP) value plots for individual flow cytometry cases. Images on the left are the two-dimensional histograms for CD40 (x-axes) vs CD15 (y-axes). Images on the right show the corresponding SHAP values in red and blue, with red more predictive of Hodgkin lymphoma. A light-gray background that depicts the original histogram is also included in right panels for context in the images on the right. Examples were chosen from cases with predicted cHL+ probability more than 0.95. cHL, classic Hodgkin lymphoma.
Visualizing Impact of Single Cells on Single Predictions
Using the SHAP values from each histogram bin from all 78 2D histograms, we were able to integrate and project the values back to the individual cells in the original raw data files. New FCS28 files were created with summed SHAP values for each individual cell. These files could then be opened and analyzed using standard flow cytometry analysis software. This results in the ability to define in high detail, in combination with all of the available flow cytometry parameters in the original data and available software tools, the cells of most impact. The example shown in Figure 4 demonstrates the result, with the most impactful cells highlighted using gating in our usual clinical flow cytometry software. The integration of SHAP values from all 78 2D histograms resulted in sufficient resolution to highlight what appears most consistent with subpopulations of lymphocytes, as demonstrated in the side-scatter vs CD45 plot (Figures 4E and 4F). Evaluation of CD5 vs CD20 demonstrates the most impactful lymphocytes are primarily T cells. The impactful cells also include many that would normally be classified as nonviable cells and excluded from evaluation in the manual gating process; it is thus likely that the ratio of nonviable to viable cells plays a significant role in the model predictions. Approximately 20 Hodgkin cells were identified through our normal manual gating process, yet none of the Hodgkin cells were found to be as impactful in the classification as the lymphocyte subgroups and nonviable cells, per our summed SHAP values analysis. The evaluation of T-cell subpopulations is in fact a part of our normal manual gating analysis, notably with evaluation for overexpression of CD729; however, the necessary antibodies (CD7, in particular) are part of a separately run antibody panel. It is conceivable that these impactful features might eventually be learned and used by pathologists (eg, via nonviable cells and lymphocyte ratio scores). It is interesting to note that the SHAP summation approach presented here highlights lymphocyte subpopulations as being important, despite the lack of some antibodies that are more specific to lymphocytes and antibody choices primarily chosen to identify Hodgkin cells.
Discussion
We present a deep learning, ensemble-based approach to detect cHL using flow cytometry data. Currently, pathologists visually interpret 2D histograms, followed by various gating strategies to highlight cells of interest. We applied visual machine learning techniques—namely, CNNs—to simulate to some degree the process used by pathologists. CNNs were also chosen because they might better preserve local context with possible improvement in results compared with prior techniques.8 Our deep learning approach does perform well in comparison with prior results of random forest and support vector machine approaches.8 Despite our approach being agnostic to the important cell populations (namely, we use nongated 2D histograms), the algorithm is able to learn useful features, and some of the most impactful 2D histograms correlate with clinical experience in using the Hodgkin lymphoma flow cytometry assay. Perhaps of more interest, however, is the fact that the model is intuitively easy for pathologists, researchers, and, potentially, regulators to understand since its architecture and outputs mirror those of the standard manual method. Other flow cytometry machine learning techniques often include some sort of principal component analysis or clustering followed by a classifier.29-31 They are very successful,32 but identification of cells of interest in principal components space and other representations is not intuitive for most pathologists. In our EnsembleCNN approach, the architecture is amenable to explainability algorithms and impactfulness measures, including SHAP values.12,13 Calculation of SHAP values for bins within 2D histograms is potentially more enlightening to hematopathologists since the graphical output can be compared directly with histograms that hematopathologists and researchers are already accustomed to interpreting.
Perhaps of most interest, using SHAP values and a novel summation approach that integrates SHAP values from all of the component 2D histograms, we are able to visually highlight cell populations that are most impactful in making the classification predictions, as an average for many cases and for individual cases. We can visualize the thus highlighted cells using flow cytometry software (Figure 4). Doing so allows us to directly visualize which cell populations are most impactful in the EnsembleCNN classifier’s individual predictions. We can directly compare those cells with cell populations identified using standard manual gating approaches and hematopathologist interpretations. In the example presented in Figure 4, which was strongly predicted to be positive for cHL by the EnsembleCNN classifier (with concordant annotation label), the populations of most impact were nonviable cells and T lymphocytes. As increased numbers of apoptotic cells and altered T-cell populations are expected to be found in cHL, the findings are consistent with the currently understood biology.33-36 Given the overall agnostic approach of the combined CNN ensemble and SHAP value summation methods, this approach might be useful in the discovery of previously unrecognized cell subpopulations in flow cytometry applications, in cHL or otherwise.
Of note, in the example case (Figure 4), none of the ~20 Hodgkin cells identified by manual gating (of 53,591 total flow cytometer events) were identified as very impactful by summed SHAP scores, suggesting that the approach is insensitive to very small cell populations (although larger nonviable and lymphocyte populations were impactful). Use of 2D histograms of gated subpopulations (eg, viable cells, CD30+ cells) is expected to increase model accuracy and detection of these small populations as the signal-to-background ratio is increased. Due to the ensemble architecture, it is simple to add additional features, including gated histograms, to the classifier. The features used for prediction in this report are 2D histograms derived from a single antibody panel (targeted for cHL identification); however, in regular clinical practice, multiple antibody panels are applied to the same specimen. These additional data could easily be used to train additional CNNs that could be added to the ensemble, since each CNN in the ensemble is trained independently. Additional features might also include 3D histograms and other higher-dimensional data, as well as FlowSOM31 and UMAP29 data representations. Insights provided by SHAP values could also guide selection of new features/gated histograms for inclusion.
Limitations of the approach include the training time required to train all CNNs within the EnsembleCNN classifier, which is substantially more than that required for support vector machine and random forest approaches. Since the training time is a one-time cost, the impact on daily use is much less. In the method’s current implementation, the 2D histograms contain data for all cells rather than gated subpopulations; inclusion of gated populations will allow more direct comparison with regular clinical flow cytometry use and identification of known important cell populations with small numbers present (eg, rare Hodgkin cells). Introduction of gated plots that contain only data for specific subpopulations may require modification of our SHAP values summation approach since the gated plots will cause overrepresentation of those cells in comparison to the rest of the data, for de novo cell population discovery. Our SHAP values summation approach, as is, can only be applied to cells that are present within the same antibody panel. Overcoming these obstacles is the object of further work.
In summary, the EnsembleCNN classifier presented here is intuitively easy for hematopathologists to understand, can easily be extended to incorporate additional data features (eg, gated histograms and additional antibody panels), and is amenable to explainability algorithms and metrics, including SHAP values. Our method for projecting SHAP values back to the original single-cell data allows direct evaluation by, and comparison with, standard flow cytometry analysis. This provides further understanding of the important cell populations and how the model makes a prediction. Use of the approach can potentially save pathologists time by prepopulating clinical reports and prehighlighting the cells of most interest for pathologist review. This may also be useful in discovery of previously unrecognized important cell populations or other cell population alterations in disease pathogenesis.
Acknowledgments
We thank the University of Washington hematopathology staff for technical help and support of this work and Patrick Mathias (University of Washington) for review of the manuscript prior to submission.
This work was supported by the National Eye Institute (K23EY029246; A.Y.L.), an unrestricted grant from Research to Prevent Blindness (A.Y.L.), the University of Washington Department of Laboratory Medicine and Pathology, a Roger Moe Fellowship (P.D.S.), and the donation of a GPU by NVIDIA Corporation (P.D.S.).
References
Author notes
Senior authors.