-
Views
-
Cite
Cite
Joshua Feldman, Andrea Thomas-Bachli, Jack Forsyth, Zaki Hasnain Patel, Kamran Khan, Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise, Journal of the American Medical Informatics Association, Volume 26, Issue 11, November 2019, Pages 1355–1359, https://doi.org/10.1093/jamia/ocz112
- Share Icon Share
Abstract
We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports.
We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports.
Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5–61) days earlier on average.
We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy.
Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.