-
PDF
- Split View
-
Views
-
Cite
Cite
Robert M Bossarte, Chris J Kennedy, Alex Luedtke, Matthew K Nock, Jordan W Smoller, Cara Stokes, Ronald C Kessler, Invited Commentary: New Directions in Machine Learning Analyses of Administrative Data to Prevent Suicide-Related Behaviors, American Journal of Epidemiology, Volume 190, Issue 12, December 2021, Pages 2528–2533, https://doi.org/10.1093/aje/kwab111
- Share Icon Share
Abstract
This issue contains a thoughtful report by Gradus et al. (Am J Epidemiol. 2021;190(12):2517–2527) on a machine learning analysis of administrative variables to predict suicide attempts over 2 decades throughout Denmark. This is one of numerous recent studies that document strong concentration of risk of suicide-related behaviors among patients with high scores on machine learning models. The clear exposition of Gradus et al. provides an opportunity to review major challenges in developing, interpreting, and using such models: defining appropriate controls and time horizons, selecting comprehensive predictors, dealing with imbalanced outcomes, choosing classifiers, tuning hyperparameters, evaluating predictor variable importance, and evaluating operating characteristics. We close by calling for machine-learning research into suicide-related behaviors to move beyond merely demonstrating significant prediction—this is by now well-established—and to focus instead on using such models to target specific preventive interventions and to develop individualized treatment rules that can be used to help guide clinical decisions to address the growing problems of suicide attempts, suicide deaths, and other injuries and deaths in the same spectrum.
Abbreviation
- ITR
individualized treatment rules
- ML
machine learning
- RF
random forest
- SA
suicide attempt
- SRB
suicide-related behavior
Editor’s note: The opinions expressed in this article are those of the authors and do not necessarily reflect the views of the American Journal of Epidemiology.
Suicide remains a devastating and persistent global public health challenge. While advances in biomedicine have led to falling rates of many other leading causes of death, little progress has been made in reducing suicides. Indeed, the suicide rate has increased in many parts of the world over the past 20 years (1). Although our ability to predict suicide-related behaviors (SRBs) has been limited (2), efforts in recent years have leveraged large-scale data resources and powerful computational methods to overcome this problem. The report by Gradus et al. (3) is an excellent example of this line of research, using machine learning (ML) methods to predict administratively recorded nonfatal suicide attempts (SAs) throughout Denmark (1995–2015) from electronic health records and other administrative data. The report advances the field by leveraging nationwide registry data containing many predictors (n = 1,458) and suicide attempts (n = 22,974) to develop sex-specific ML models that achieve high accuracy (area under the curve = 0.87–0.91) and, consistent with other recent studies (4–7), substantial concentrations of SAs in the high-risk tier (e.g., ≥40% of incident SAs among the 5% of people with highest predicted risk) (3). These results add to growing evidence that computational models from administrative data can stratify SRB risk at levels well beyond those achievable based exclusively on clinical evaluations.
At the same time, challenges exist in making methodologic choices in SRB prediction modeling (8, 9). The clear and thoughtful exposition by Gradus et al. provides an opportunity to review best practices in these choices. We do that here, focusing on defining controls and risk horizons, selecting predictors, dealing with imbalanced outcomes, choosing classifiers, tuning hyperparameters, evaluating predictor importance, and examining operating characteristics. We close by arguing that the time has come to begin developing clinically useful models tied to interventions and matching risk profiles with intervention options.
DEFINING CASES AND CONTROLS
Gradus et al. define cases as administratively recorded incident nonfatal SAs throughout Denmark in 1995–2015. These are distinct from suicide deaths, which the authors reported on separately (10). Suicide deaths are treated along with all other deaths in the present Gradus et al. study as competing risks that result in patients being censored. It is noteworthy in this regard that previous research has shown that the predictors of suicide deaths are quite different from the predictors of nonfatal suicide attempts (11). For example, the suicide death rate is much higher among men than among women, whereas the nonfatal suicide attempt rate is much higher among women than men (12). Most of the methodological points we make below, however, apply equally to all SRBs whether they are fatal or nonfatal.
Gradus et al. define controls as randomly selected single months with no SA history as of the start date for a probability sample of “individuals living in Denmark on January 1, 1995.” The inclusion of post-1994 immigrants as cases but not controls introduces bias, given Danish immigrants (8% of the population) have a much higher SA rate than nonimmigrants (13, 14). This bias could have been removed either by requiring not only controls but also cases to be living in Denmark on January 1, 1995, or by removing this restriction from controls. The decision by Gradus et al. to select a random month for controls rather than sample by person-month proportional to cases in a case-crossover design with appropriate censoring could have introduced additional biases. As reviewed elsewhere (8), similar issues in selecting appropriate controls are common in ML SRB studies.
RISK HORIZON
Gradus et al. create nested retrospective time intervals (from 0–6 to 0–48 months before SA) to aggregate electronic-health-record predictors involving visits, prescriptions, and procedures. This allows slopes to decay over longer lags (15). However, by including a 0–6 month retrospective time interval, the risk horizon (i.e., the survival period from last predictor assessment) is implicitly set at 0 months; that is, information from earlier in the same month as the SA is included among the predictors. This is not the optimal risk horizon for many applications, which would generally be prospective rather than cross-sectional. For example, a primary care provider seeing a patient for an annual physical might be concerned with SRB risk over the next 12 months, whereas an emergency department doctor deciding whether to hospitalize a suicidal patient might be more concerned with imminent risk (16). Only models with appropriate risk horizons can address these concerns. As important predictors change across risk horizons, some researchers develop separate models for different horizons using a fixed-point design rather than a retrospective case-control design. The retrospective creation of the predictor variables further presents a common form of temporal bias between the cases and the controls, which has recently been described (17). The differential time horizons of cases compared with controls has the effect of inflating the estimated discriminative performance relative to designs in which cases and controls have equivalent index events, such as the date of a medical visit or precursor diagnosis. Careful thought is consequently needed about the clinical decision the analysis is trying to mimic so as to make sure the temporal sampling in the design is appropriate for that purpose. Gradus et al. are unclear about this critical issue.
PREDICTORS
Gradus et al. assembled an enviable set of predictors from high-quality Danish registries. Exploiting available predictor information in this way is critical for optimizing ML SRB models. Many investigators fail to do this. But some key opportunities along these lines were not exploited by Gradus et al. Some examples follow.
It sometimes is useful to flip nested retrospective diagnostic codes across time intervals to learn about recency of first onset rather than most recent visit. For example, SRB risk associated with cancer is highest in the months after initial diagnosis and subsequently declines differentially by cancer type (18).
Along with aggregating 2-digit International Classification of Diseases, Tenth Revision (ICD-10) codes, it would have been useful to distinguish some disorder classes in a more fine-grained way given the existence of evidence about differences in SRB risk based on such distinctions (19). In addition, some cross-code composites, such as for chronic pain conditions (20) and the conjunction of primary and secondary site codes to identify metastasized cancer originating at a different site (21), predict SRBs. In predicting all SAs rather than incident SAs, information about prior SAs is critically important given that SA history is one of the strongest predictors of future SAs (22). Importantly, the predictors of incident SAs differ from the predictors of repeat SAs (23).
Psychiatric treatment sector is also important, because psychiatric hospitalizations and emergency department visits are strong SRB predictors. In addition, ICD-10 S, T, and X codes capture information about external causes of injuries associated with exposure to violence, abuse, and maltreatment that predict SRBs (24), whereas ICD-10 Z codes capture information about social determinants of health (e.g., homelessness, psychosocial needs, family stressors) that predict SRBs (25).
It is sometimes also possible to access and analyze clinical notes using natural language processing methods to elicit other information predicting SRBs (26).
Medication codes collapsed across therapeutic classes can be refined to distinguish medications within classes that protect against SRBs, such as lithium for bipolar disorder (27) and clozapine for schizophrenia (28), and to distinguish medications with suicide risk warnings or that pharmacoepidemiologic studies find predict SRBs (29).
Finally, in analyses spanning long time periods, as in the Gradus et al. study, it can be useful to include historical time as a predictor to investigate time trends in main effects and interactions. Gradus et al. found such an interaction involving sex that led to estimating sex-specific models. But other, perhaps more subtle interactions with time might exist that could have been examined if time had been included as a predictor and allowed to interact with other variables in the RF analysis.
IMBALANCED DATA
Most algorithms produce suboptimal results when predicting rare dichotomous outcomes unless adjustments are made (30). Numerous methods developed for this purpose differentially weight or sample cases versus controls (data-based methods) or, like Gradus et al., weight the algorithm loss function to penalize false-negative results more than false-positive results (classifier-based methods (31)). Because model performance can vary based on this decision, it is useful to explore a range of penalties (32).
CHOOSING CLASSIFIERS
No single ML classifier is optimal for all prediction problems. Kaggle competitions typically find that random forest (RF), the classifier Gradus et al. used, various kinds of gradient boosting (XGBoost, Lightgbm, Catboost, or a stacked ensemble of more than one of these options), and Bayesian additive regression trees outperform other methods in predicting structured data like those in ML SRB models (33, 34). Relative algorithm performance varies across applications, though, and promising new algorithms are constantly emerging (35). Some researchers address this by replicating analyses with several different algorithms in a training sample before picking the best classifier for their final model. As noted by Gradus et al., however, this is unnecessary, because an ensemble method exists that uses cross-validation to create an optimal weighted combination of predicted outcome scores across multiple classifiers that is guaranteed in expectation to perform at least as well as the best component algorithm (36, 37).
TUNING HYPERPARAMETERS
Prediction models typically require some model characteristics to be fixed (hyperparameters) before estimation. Changing (tuning) these hyperparameters often leads to substantial prediction improvement (38, 39). Gradus et al. followed standard practice for fixed RF hyperparameter values, but a data-driven method to optimize hyperparameter selection should be used when development is for implementation rather than demonstration (40).
EVALUATING PREDICTOR IMPORTANCE
Gradus et al. focus on predictor variable importance. This is understandable; researchers and clinicians are interested in the predictors that drive model performance. It is critical to recognize, though, that ML models prioritize prediction accuracy over variable interpretation, resulting in ML importance metrics having limited value. For example, the several different importance metrics in RF typically produce distinct variable importance rankings, all of which are: 1) biased by favoring predictors with many values and low correlations with other predictors; and 2) internally inconsistent (i.e., estimated variable importance can go down when the model is changed to rely more on that variable (41, 42)).
A recently developed procedure called TreeExplainer resolves some of these problems (43) and introduces procedures to discover critical interactions and differences in RF variable importance across individuals (44). Nonetheless, caution is still needed in making substantive interpretations of such results. Prediction importance should not be confused with causal importance. Instead, predictor importance analysis is most useful for pruning predictors to increase out-of-sample model performance. If researchers have a serious interest in investigating variable importance for targeting interventions, a better approach is to assign a predicted probability of SRB to each observation based on the ML model, apply precision-recall curves or net benefit curves (see next section) to select an intervention decision threshold, and inspect clustering of modifiable predictors that distinguish above versus below that threshold for targeted learning analyses (45) to evaluate potential implications of preventive interventions.
OPERATING CHARACTERISTICS
The Gradus et al. RF models have excellent operating characteristics. For example, 44.7% of incident SAs among men occurred among the 5% of men with highest cross-validation risk (3). However, given the rarity of SA (87/100,000 male person-years), precision-recall curves should also be estimated to examine associations between positive predictive value (“precision”) and sensitivity (“recall”). This would allow us to see, for example, that 6-month positive predictive value is only about 0.4% among the 5% of men with highest predicted SA risk, a fact that was not emphasized by Gradus et al. Just as the area under the curve–receiver operating characteristic summarizes the association between sensitivity and 1 minus specificity in a receiver operating characteristic curve, the area under the curve–precision-recall summarizes the association between positive predictive value and sensitivity in a precision-recall curve. This becomes especially important when ML is used to guide clinical decision-making, because joint understanding of risk above a decision threshold in the absence of intervention and likely intervention effectiveness are needed to convey information about clinical implications of an intervention (46). When intervention costs and benefits are known, it is also useful to calculate a net benefit curve to arrive at a principled decision threshold (47). As a best practice, it is also useful to examine the calibration of the final model—how similar the model’s probability predictions are to the observed probabilities of SA (48). Replicability and transparency can be enhanced further by following model reporting guidelines (e.g., Collins et al. (49)).
WHERE SHOULD WE GO FROM HERE?
In addition to the above suggestions for optimizing prediction and interpretation, it is noteworthy that other types of data are increasingly being used to improve ML SRB models, (e.g., patient-report surveys, digital devices, biomarkers (50–52)). Because these can be burdensome and expensive, though, careful consideration is needed of incremental benefits. A tiered approach can do this by beginning with passively collected administrative data like those used by Gradus et al., with a focus on ruling out low-risk patients from further assessments (8). Subsequent steps can then include inexpensive self-report surveys followed by additional ML analyses to target successively more refined subsets of patients for more intensive and/or expensive asessments culminating in in-depth clinical evaluations based on all data (9).
Future ML SRB research also needs to be more focused on why ML SRBs are being developed. Gradus et al. note that such models have two purposes: to provide insights into important SRB predictors and to pinpoint high-risk patients for targeted interventions. We already noted the limitations of ML importance measures in achieving the first purpose. Two issues can also be raised about the second purpose. One was already noted: that models designed to target preventive interventions should select samples and time horizons appropriate to those interventions (53). This is not always done in existing aplications (9, 17).
The other is that intervention assignment should be based ideally on an understanding of not only SRB risk but also comparative effects of intervention options for specific patients. For example, controlled trials show that several outpatient therapies reduce SRBs among patients reporting suicidality (54). Conventional ML SRB prediction models might do an excellent job determining which patients are most in need of these interventions. However, conventional ML SRB prediction models do not help determine which intervention is best for which patient. Individualized treatment rules (ITRs) providing this information would be of great value.
ML models to develop ITRs are different from models to predict SRB risk because patients at highest risk are not necessarily those most likely to be helped by available interventions. ITR models instead evaluate interactions between presumed prescriptive predictors (i.e., predictors of greater response to one intervention than to others) and intervention type received (55). Although ITR models should ideally be developed in comparative effectiveness trials, such trials can sometimes be emulated in observational samples (56). Such adjusted databases can yield results close to those of randomized clinical trials (57). Extensions exist to develop ITRs that estimate the subset of patients that would benefit from intervention (58).
CONCLUSIONS
We do not want to leave the impression that the Gradus et al. study is an inferior example of an ML SRB study. It is not. The problems we described exist throughout the ML SRB literature and more generally in epidemiologic studies focused narrowly on prediction rather than broader considerations of impact or use (59). It is by now well-known that ML methods can be used to predict SRBs. We consequently need to take the next step of applying thoughtful versions of ML SRB models to help target and evaluate carefuly selected interventions in appropriate samples and with risk horizons appropriate to these interventions. This should be followed by subsequent ML analyses to develop ITRs for comparative effectiveness of alternative interventions aimed at optimizing interventions to address the growing problems of SAs, suicides, and other external causes of injury and death in the same spectrum (60).
ACKNOWLEDGMENTS
Author affiliations: Department of Behavioral Medicine and Psychiatry, West Virginia University School of Medicine, Morgantown, West Virginia, United States (Robert M. Bossarte, Cara Stokes); US Department of Veterans Affairs Center of Excellence for Suicide Prevention, Canandaigua, New York, United States (Robert M. Bossarte, Cara Stokes); Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States (Chris J. Kennedy); Department of Statistics, University of Washington, Seattle, Washington, United States (Alex Luedtke); Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States (Alex Luedtke); Department of Psychology, Harvard University, Cambridge, Massachusetts, United States (Matthew K. Nock); Department of Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, United States (Jordan W. Smoller); and Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, United States (Ronald C. Kessler).
This work was supported, in part, by the Department of Veterans Affairs Center of Excellence for Suicide Prevention (R.M.B.) and the National Institute of Mental Health of the National Institutes of Health (grant R01MH121478, R.C.K.).
The contents are solely the responsibility of the authors and do not necessarily represent the views of the funding organizations.
In the past 3 years, R.C.K. was a consultant for Datastat Inc., Sage Pharmaceuticals, and Takeda. The other authors report no conflicts.