Abstract

Objective

We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports.

Materials and Methods

We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports.

Results

Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5–61) days earlier on average.

Discussion

We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy.

Conclusion

Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
You do not currently have access to this article.