Abstract

Objective

The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants’ systems, and the performance results.

Methods

The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events).

Results

In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora.

Conclusion

To facilitate future work, the datasets—a total of 61 353 posts—will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.

Background

With more than 70% of adults in the United States1 and nearly 60% of people worldwide2 using social media, the aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing the vast amount of data on social media for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the American Medical Informatics Association (AMIA) 2023 Annual Symposium and consisted of 5 tasks. Task 1 was a binary classification task to distinguish English-language tweets that self-report a COVID-19 diagnosis from those that do not.3 Task 2 was a multi-class classification task to categorize users’ sentiment toward therapies in English-language tweets as positive, negative, or neutral.4 Task 3 was a sequence labeling task to extract COVID-19 symptoms in tweets written in Latin American Spanish. Task 4 was a binary classification task to distinguish English-language Reddit posts that self-report a social anxiety disorder diagnosis from those that do not. Task 5 was a normalization task to map adverse drug events (ADEs) in English-language tweets to their standard concept ID in the MedDRA vocabulary.5

Teams could register for a single task or multiple tasks. In total, 29 teams registered, representing 17 countries. Teams were provided with gold standard annotated training and validation sets to develop their systems and, subsequently, a blind test set for the final evaluation. After receiving the test set, teams were given 5 days to submit the predictions of their systems to CodaLab6–10—a platform that facilitates data science competitions—for automatic evaluation, promoting the systematic comparison of performance. Among the 29 teams that registered, 17 teams submitted at least one set of predictions: 8 teams for Task 1, 6 teams for Task 2, 2 teams for Task 3, 8 teams for Task 4, and 3 teams for Task 5. Teams were then invited to submit a short manuscript describing their system, and 12 of the 17 teams did. Each of these 12 system descriptions was peer-reviewed by at least 2 reviewers. In this article, we present the annotated corpora, a technical summary of the systems, and the performance results, providing insights into state-of-the-art methods for mining social media data for health informatics.

Methods

Data collection

We collected a total of 61 353 social media posts for the 5 tasks. For Task 1, the dataset included 10 000 English-language tweets that mentioned a personal reference to the user and keywords related to both COVID-19 and a positive test, diagnosis, or hospitalization.3 For Task 2, the dataset included 5364 English-language tweets that mentioned a total of 32 therapies, such as medication, behavioral, and physical.4 These tweets were posted by a cohort of users who self-reported chronic pain on Twitter,11 so it is likely that the sentiments associated with the therapies are being expressed by patients who are actually experiencing them. For Task 3, the dataset included 8861 tweets that were written in Latin American Spanish and reported COVID-19 symptoms of the user or someone known to the user, building on a previous iteration of this task involving the multi-class classification of Spanish-language tweets that mentioned COVID-19 symptoms.12 For Task 4, the dataset included 5140 English-language Reddit posts in the r/socialanxiety subreddit, written by users aged 13-25 years. For Task 5, the dataset included 29 449 English-language tweets that mentioned medications, with many from previous iterations of this task.12–17

Annotation

For all 5 tasks, at least a subset of the social media posts was annotated by multiple annotators. Table 1 presents inter-annotator agreement (IAA) and the distribution of the posts in the training, validation, and test sets. For Task 1, 1728 (17.3%) of the tweets were labeled as self-reporting a COVID-19 diagnosis—a positive test, clinical diagnosis, or hospitalization—and 8272 (82.7%) as not.3 For Task 2, 998 (18.6%) of the tweets were labeled as positive sentiment toward a therapy, 619 (11.5%) as negative sentiment, and 3747 (69.9%) as neutral sentiment.4 For Task 3, among the 8861 tweets, 10 145 spans of text containing COVID-19 symptoms were annotated by medical doctors who are native speakers of Latin American Spanish. For Task 4, 2428 (38%) of the Reddit posts were labeled as self-reporting a confirmed or probable clinical social anxiety disorder diagnosis and 3962 (62%) as not. For Task 5, 2219 (7.5%) of the tweets were labeled as reporting an ADE and 27 230 (92.5%) as not. Among these 2219 tweets, 3021 spans of text containing an ADE were annotated and labeled with a corresponding MedDRA ID. Among the 1224 ADEs in the test set, 272 (22.2%) were not reported in the training or validation sets.

Table 1.

Inter-annotator agreement (IAA) and distribution of social media posts in the training, validation, and test sets for the 5 #SMM4H 2023 shared tasks.

TaskPlatformLanguageTrainingValidationTestTotalIAA
1TwitterEnglish7600400200010 0000.79a
2TwitterEnglish3009753160253640.82b
3TwitterSpanish60211979215010 1500.88b
4RedditEnglish4500640125063900.80b
5TwitterEnglish17 38591511 14929 4490.68c
TaskPlatformLanguageTrainingValidationTestTotalIAA
1TwitterEnglish7600400200010 0000.79a
2TwitterEnglish3009753160253640.82b
3TwitterSpanish60211979215010 1500.88b
4RedditEnglish4500640125063900.80b
5TwitterEnglish17 38591511 14929 4490.68c
a

Fleiss’ Kappa.

b

Cohen’s Kappa.

c

F1-score.

Table 1.

Inter-annotator agreement (IAA) and distribution of social media posts in the training, validation, and test sets for the 5 #SMM4H 2023 shared tasks.

TaskPlatformLanguageTrainingValidationTestTotalIAA
1TwitterEnglish7600400200010 0000.79a
2TwitterEnglish3009753160253640.82b
3TwitterSpanish60211979215010 1500.88b
4RedditEnglish4500640125063900.80b
5TwitterEnglish17 38591511 14929 4490.68c
TaskPlatformLanguageTrainingValidationTestTotalIAA
1TwitterEnglish7600400200010 0000.79a
2TwitterEnglish3009753160253640.82b
3TwitterSpanish60211979215010 1500.88b
4RedditEnglish4500640125063900.80b
5TwitterEnglish17 38591511 14929 4490.68c
a

Fleiss’ Kappa.

b

Cohen’s Kappa.

c

F1-score.

Classification

For Task 1, the benchmark system3 and 5 of the 7 teams (Table 2 in Results) that submitted a system description (Shayona,18 UQ,19 IICU-DSRG, KUL,20 and TMN21) used a classifier based on COVID-Twitter-BERT—a transformer model pre-trained on tweets related to COVID-19.22 Two of these 5 teams used additional BERT-based models for the feature representation (IICU-DSRG) or ensemble learning (TMN21); IICU-DSRG used BioMed-RoBERTa-Base,23 and TMN21 used RoBERTa-Large23 and Twitter-RoBERTa-Base.24 One of the 2 teams that did not use COVID-Twitter-BERT22 (Explorers25) used an ensemble of RoBERTa-Base,23 CPM-RoBERTa, and BERTweet26 models, following 5-fold cross-validation for each individual model. They also used the tweets in the Task 1 and Task 4 training and validation sets for continued domain-adaptive pre-training of these models.27 The other team (BFCI28) used a Passive Aggressive classifier with bigrams as features in a bag-of-words representation. Three of the teams (UQ,19 KUL,20 and Explorers25) used techniques for addressing the class imbalance, including data augmentation (UQ19 and KUL20), over-/under-sampling (UQ19), and class weights (UQ19 and Explorers25).

Table 2.

Performance of benchmark systems and 18 teams’ systems on the test sets for the 5 #SMM4H 2023 shared tasks, with the best performance for each task in bold.

TeamTask 1aTask 2bTask 3cTask 4dTask 5aeTask 5bf
ABCD0.1880.089
BFCI0.6370.714
CEN0.718
DS4DH0.4260.292
Explorers0.8720.940.695
HULAT-UC3M0.669
I2R-MI0.7520.3220.195
IICU-DSRG0.898
ITT0.728
KUL0.877
Mantis0.838
MUCS0.7410.185
Shayona0.9430.810
ThaparUni0.842
TMN0.845
UQ0.9430.7780.871
Unknown-1g0.6950.841
Unknown-2g0.89
Task 1 Benchmark0.938
Task 3 Benchmark0.90
Task 3 Benchmark0.92
Task 5 Benchmark0.4520.363
TeamTask 1aTask 2bTask 3cTask 4dTask 5aeTask 5bf
ABCD0.1880.089
BFCI0.6370.714
CEN0.718
DS4DH0.4260.292
Explorers0.8720.940.695
HULAT-UC3M0.669
I2R-MI0.7520.3220.195
IICU-DSRG0.898
ITT0.728
KUL0.877
Mantis0.838
MUCS0.7410.185
Shayona0.9430.810
ThaparUni0.842
TMN0.845
UQ0.9430.7780.871
Unknown-1g0.6950.841
Unknown-2g0.89
Task 1 Benchmark0.938
Task 3 Benchmark0.90
Task 3 Benchmark0.92
Task 5 Benchmark0.4520.363
a

F1-score for the class of tweets that self-reported a COVID-19 diagnosis.

b

Micro-averaged F1-score for all 3 classes of tweets.

c

Strict F1-score for identifying the character offsets of COVID-19 symptoms.

d

F1-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis.

e

Micro-averaged F1-score for all MedDRA IDs.

f

Micro-averaged F1-score for MedDRA IDs that were not seen during training.

g

CodaLab submission could not be cross-referenced with a registered team name.

Table 2.

Performance of benchmark systems and 18 teams’ systems on the test sets for the 5 #SMM4H 2023 shared tasks, with the best performance for each task in bold.

TeamTask 1aTask 2bTask 3cTask 4dTask 5aeTask 5bf
ABCD0.1880.089
BFCI0.6370.714
CEN0.718
DS4DH0.4260.292
Explorers0.8720.940.695
HULAT-UC3M0.669
I2R-MI0.7520.3220.195
IICU-DSRG0.898
ITT0.728
KUL0.877
Mantis0.838
MUCS0.7410.185
Shayona0.9430.810
ThaparUni0.842
TMN0.845
UQ0.9430.7780.871
Unknown-1g0.6950.841
Unknown-2g0.89
Task 1 Benchmark0.938
Task 3 Benchmark0.90
Task 3 Benchmark0.92
Task 5 Benchmark0.4520.363
TeamTask 1aTask 2bTask 3cTask 4dTask 5aeTask 5bf
ABCD0.1880.089
BFCI0.6370.714
CEN0.718
DS4DH0.4260.292
Explorers0.8720.940.695
HULAT-UC3M0.669
I2R-MI0.7520.3220.195
IICU-DSRG0.898
ITT0.728
KUL0.877
Mantis0.838
MUCS0.7410.185
Shayona0.9430.810
ThaparUni0.842
TMN0.845
UQ0.9430.7780.871
Unknown-1g0.6950.841
Unknown-2g0.89
Task 1 Benchmark0.938
Task 3 Benchmark0.90
Task 3 Benchmark0.92
Task 5 Benchmark0.4520.363
a

F1-score for the class of tweets that self-reported a COVID-19 diagnosis.

b

Micro-averaged F1-score for all 3 classes of tweets.

c

Strict F1-score for identifying the character offsets of COVID-19 symptoms.

d

F1-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis.

e

Micro-averaged F1-score for all MedDRA IDs.

f

Micro-averaged F1-score for MedDRA IDs that were not seen during training.

g

CodaLab submission could not be cross-referenced with a registered team name.

For Task 2, 2 of the 4 teams that submitted a system description (UQ19 and I2R-MI29) used BERT-based classifiers: BERTweet-Large26 (UQ19) and RoBERTa-Base23 (IR2-MI29). These teams also used techniques for addressing the class imbalance, including data augmentation (UQ19), over-sampling (UQ19), under-sampling (UQ19 and IR2-MI29), and class weights (UQ19). Furthermore, UQ19 used a large language model (LLM), GPT-3.5,30 to verify the predictions of the BERTweet-Large26 classifier, automatically changing the prediction to the majority (ie, neutral sentiment) class if the LLM did not find evidence in the tweet to support the classification of positive or negative sentiment. The 2 teams that did not use a BERT-based classifier (BFCI28 and HULAT-UC3M) used a Multilayer Perceptron classifier with 5-fold cross-validation (BFCI28) and a Random Forest classifier with feature engineering (HULAT-UC3M).

For Task 4, all 5 of the teams that submitted a system description (UQ,19 ThaparUni,31 Mantis,32 Shayona,18 and Explorers25) used BERT-based models, with 2 of the teams (UQ19 and Explorers25) using BERTweet26 models. As in Task 2, UQ19 used techniques for addressing the class imbalance and a LLM to correct the predictions of the BERTweet26 classifier. As in Task 1, Explorers25 used BERTweet26 in an ensemble with RoBERTa-Base23 and CPM-RoBERTa models, domain-adaptive pre-training,27 and techniques for addressing the class imbalance. Also as in Task 1, Shayona18 used a COVID-Twitter-BERT22 model with gradient boosting.33 Along with Explorers,25 the other 2 teams (ThaparUni31 and Mantis32) used an ensemble of pre-trained transformer models; while ThaparUni31 used RoBERTa,23 ERNIE 2.0,34 and XLNet35 models, which were pre-trained on general-domain corpora, Mantis32 used MentalRoBERTa36 and PyschBERT37 models, which were pre-trained on corpora related to mental health, including Reddit posts.

Extraction

For Task 3, the one team that submitted a system description (Explorers25) used the W2NER framework38 for named entity recognition and an ensemble of 3 Spanish BERT-based models: Spanish BERT (BETO),39 BETO+NER, and a version of BETO that was fine-tuned using the Spanish portion of the XLNI corpus.40 As in Task 1 and Task 4, they also used the tweets in the training and validation sets for continued domain-adaptive pre-training27 of these models. Two benchmark systems used BETO39 and COVID-Twitter-BERT22 models. For Task 5, the benchmark system4 and 2 teams that submitted a system description (DS4DH41 and I2R-MI29) also used pre-trained transformer models, in a pipeline approach that involves extracting ADEs as input for mapping the tweets to MedDRA terms: BERT42 (benchmark system5), BERTweet26 (DS4DH41), and T543 (I2R-MI29).

Normalization

For Task 5, the 2 teams that submitted a system description (DS4DH41 and I2R-MI29) took similar approaches to ADE normalization, using similarity metrics to compare the vector representations of extracted ADEs to MedDRA terms, based on pre-trained sentence transformer models. DS4DH41 used multiple sentence transformers to represent each extracted ADE—S-PubMedBERT,44 All-MPNet-Base-v2,45 All-DistilRoBERTa-v1,45 All-MiniLM-L6-v2,45 and a custom sentence transformer that integrates knowledge from the Unified Medical Language System (UMLS) Metathesaurus—and aggregated their similarity scores via reciprocal-rank fusion,46 while I2R-MI29 used only a single SBERT45 model for the similarity ranking. In contrast, the benchmark system5 took a learning-based approach, using a BERT42 model for multi-class classification.

Results and discussion

Table 2 presents the performance of 18 teams’ systems on the test sets for the 5 tasks. For Task 1, the evaluation metric was the F1-score for the class of tweets that self-reported a COVID-19 diagnosis. Shayona18 and UQ19 achieved nearly identical F1-scores (0.943) using a COVID-Twitter-BERT22 model, with Shayona18 marginally outperforming UQ19; however, neither gradient boosting33 (Shayona18) nor techniques for addressing the class imbalance (UQ19) substantially improved performance over a COVID-Twitter-BERT benchmark classifier3 (0.938), which was trained using the Flair library with 5 epochs, a batch size of 8, and a learning rate of 1e-5. While the recall of Shayona’s18 (0.938) and UQ’s19 (0.950) systems outperformed that of the benchmark system (0.914), the benchmark system3 achieved the highest precision (0.962) among all of the systems. These 3 systems outperformed IICU-DSRG’s and TMN’s21 systems that used COVID-Twitter-BERT22 with additional BERT-based models. While KUL20 did not use additional models, the lower performance of their system (0.877) may be a result of the variation in hyperparameters used for fine-tuning COVID-Twitter-BERT22: 10 epochs, a batch size of 32, and a learning rate of 5e-5. All of the systems that used BERT-based classifiers substantially outperformed BFCI’s28 Passive Aggressive classifier (0.637).

For Task 2, the evaluation metric was the micro-averaged F1-score for all 3 classes of tweets: positive sentiment, negative sentiment, and neutral sentiment. UQ19 achieved not only the highest micro-averaged F1-score (0.778), but also the highest F1-score for each of the positive (0.611), negative (0.415), and neutral (0.859) classes. An ablation study by UQ19 showed that the use of a LLM30 to verify the predictions of the BERTweet-Large26 classifier improved the micro-averaged F1-score by more than 3 points, contributing to outperforming I2R-MI’s29 RoBERTa-Base23 classifier (0.752). UQ’s19 and I2R-MI’s29 BERT-based classifiers outperformed BFCI’s28 Multilayer Perceptron (0.714) and HULAT-UC3M’s Random Forest (0.669) classifiers.

For Task 3, the evaluation metric was the strict F1-score for identifying the character offsets of COVID-19 symptoms. Explorers’25 ensemble of 3 Spanish BERT-based models and continued domain-adaptive pre-training27 of these models achieved not only a higher F1-score (0.94) than that of BETO39 (0.90) and COVID-Twitter-BERT22 (0.92) benchmark systems, but also a higher precision (0.94) and recall (0.93). Despite the high performance of Explorers’25 system, their error analysis revealed that colloquial/regional spelling variants of symptoms remain a challenge; for example, their system was able to identify gripe, but not gripa and gripes.

For Task 4, the evaluation metric was the F1-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis. Using the same methods as in Task 2, UQ19 also achieved the highest F1-score (0.871) for Task 4. As with UQ,19 Explorers25 also used BERTweet26 and techniques for addressing the class imbalance, though they used it an ensemble with 2 additional BERT-based models and achieved a substantially lower F1-score (0.695). ThaparUni’s31 ensemble of 3 transformer models that were pre-trained on general-domain corpora (0.842)—RoBERTa,23 ERNIE 2.0,34 and XLNet35—marginally outperformed Mantis’32 ensemble of 2 transformer models that were pre-trained on corpora related to mental health, including Reddit posts (0.838)—MentalRoBERTa36 and PyschBERT.37 Both of these ensemble-based systems outperformed Shayona’s18 COVID-Twitter-BERT22 classifier (0.810), suggesting that this model may not generalize to non-COVID-19 diagnoses.

To enable novel approaches for Task 5, in contrast to previous iterations of this task that used a pipeline-based evaluation,12–17 the evaluation did not include the output of classification or extraction. The primary evaluation metric was the micro-averaged F1-score for all MedDRA IDs in the test set. A secondary evaluation metric was based on a zero-shot learning setup—that is, for MedDRA IDs in the test set that were not seen during training. While DS4DH41 achieved the highest overall precision (0.449) and I2R-MI29 achieved the highest recall for the unseen ADEs (0.406), the benchmark system5 achieved the highest overall recall (0.508) and F1-score (0.452) and the highest precision (0.335) and F1-score (0.363) for the unseen ADEs. In contrast to DS4DH’s41 and I2R-MI’s29 2-component pipelines, the benchmark system’s pipeline includes an initial binary classification component to detect tweets that mention an ADE, pre-filtering the tweets for downstream extraction and normalization.5

In general, the top-performing team for each of the 5 tasks used a deep neural network architecture based on pre-trained transformer models. In total, 10 of the 12 teams that submitted a system description used pre-trained transformer models. The 2 teams that used feature-engineered algorithms were outperformed by all of the teams that used pre-trained transformer models. In particular, the top-performing team for each of the 3 classification tasks used a model that was pre-trained on a social media corpus. While half—5 of the 10—teams that participated in the classification tasks used an ensemble-based system, the top-performing system for each of these 3 tasks was based on a single model. In addition, the techniques that 4 of these 10 teams used for addressing the class imbalance did not appear to substantially improve performance. In contrast, for the second iteration of the #SMM4H shared tasks, which was hosted at the AMIA 2017 Annual Symposium, deep neural networks were used in only approximately half of the systems and were outperformed by Support Vector Machine (SVM) and Logistic Regression classifiers on highly imbalanced data.17 Although none of the teams used SVMs for the #SMM4H 2023 shared tasks, our recent work47–49 has demonstrated that BERT-based classifiers outperform SVM classifiers for health-related tasks using social media data.

The high level of performance that can be achieved with recent advances in deep learning and pre-trained transformer models enables the large-scale use of social media as a complementary source of real-world data for research applications. For example, the benchmark classifier (F1-score = 0.938) that was developed for Task 13 was deployed on the timelines of more than 67 000 users who had reported their pregnancy on Twitter, for a retrospective cohort study that used self-reports in longitudinal social media data to assess the association between COVID-19 infection during pregnancy and preterm birth.50 This cohort study also points to the fact that, despite recent changes to the accessibility of data through the Twitter and Reddit APIs, data that were previously collected continue to have utility. For collecting new data, however, these changes may motivate researchers to explore additional platforms that have been proposed in the #SMM4H Workshop, such as WebMD,12 Ask a Patient,51 YouTube,52 Tumblr,53 Mumsnet,54 MedHelp,55 Facebook,55,56 Diabetes Info,57 and Beyond Blue.58

Conclusion

This article presented an overview of the #SMM4H 2023 shared tasks, which represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and ADEs). To facilitate future work, the datasets will remain available by request, and the CodaLab sites6–10 will remain active to automatically evaluate new systems against the blind test sets, promoting the ongoing systematic comparison of performance.

Acknowledgments

The authors thank those who contributed to annotating the data, the program committee of the #SMM4H 2023 Workshop, and additional peer reviewers of the system description papers.

Author contributions

A.Z.K.: data collection, annotation, benchmark system, system description review, and wrote the manuscript. J.M.B.: shared task conceptualization, data collection, annotation, benchmark system, CodaLab site, system description review, and wrote the manuscript. Y.G.: data collection, annotation, CodaLab site, system description review, and wrote the manuscript. A.L.S.: data collection, annotation, CodaLab site, system description review, and wrote the manuscript. D.X.: data collection, annotation, benchmark system, CodaLab site, system description review, and wrote the manuscript. J.I.F.A.: data collection, CodaLab site, and edited the manuscript. R.R.-E.: shared task conceptualization and edited the manuscript. A.S.: shared task conceptualization and edited the manuscript. G.G.-H.: shared task conceptualization and guidance, and edited the manuscript.

Funding

A.Z.K., I.F.A., D.X., and G.G.-H. were supported in part by the National Library of Medicine (R01LM011176). Y.G. and A.S. were supported in part by the National Institute on Drug Abuse (R01DA057599). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. J.M.B. was supported in part by a Google Award for Inclusion Research (AIR).

Conflicts of interest

None declared.

Data availability

According to the Twitter Terms of Service, the content (eg, text) of Tweet Objects cannot be made publicly available; however, a limited number of Tweet Objects are permitted to be shared directly. Requests for data can be sent to A.Z.K. ([email protected]) or G.G.-H. ([email protected]).

References

1

Auxier
B
,
Anderson
M.
Social media use in 2021. Pew Research Center. 7 April
2021
. Accessed October 20, 2023. https://www.pewresearch.org/internet/2021/04/07/social-media-use-in-2021/

2

Dixon
SJ.
Number of global social network users 2017-2027. Statista. 29 August
2023
. Accessed October 20, 2023. https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/

3

Klein
AZ
,
Kunatharaju
S
,
O’Connor
K
, et al.
Automatically identifying self-reports of COVID-19 diagnosis on Twitter: an annotated data set, deep neural network classifiers, and a large-scale cohort
.
J Med Internet Res
.
2023
;
25
:
e46484
.

4

Guo
Y
,
Das
S
,
Lakamana
S
, et al.
An aspect-level sentiment analysis dataset for therapies on Twitter
.
Data Brief
.
2023
;
50
:
109618
.

5

Magge
A
,
Tutubalina
E
,
Miftahutdinov
Z
, et al.
DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter
.
J Am Med Inform Assoc
.
2021
;
28
(
10
):
2184
-
2192
.

6

CodaLab. SMM4H 2023 Task 1 – Binary classification of English tweets self-reporting a COVID-19 diagnosis. Accessed October 20,

2023
. https://codalab.lisn.upsaclay.fr/competitions/12763

7

CodaLab. SMM4H 2023 Task 2: Classification of sentiment associated with therapies (aspect-oriented). Accessed October 20, 2023. https://codalab.lisn.upsaclay.fr/competitions/12421

8

CodaLab. SMM4H 2023 Task 3 – Extraction of COVID-19 symptoms in Latin American Spanish tweets. Accessed October 20,

2023
. https://codalab.lisn.upsaclay.fr/competitions/12901

9

CodaLab. SMM4H23 Task 4 – Binary classification of posts self-reporting a social anxiety disorder diagnosis. Accessed October 20, 2023. https://codalab.lisn.upsaclay.fr/competitions/13160

10

CodaLab. SMM4H 2023 Task 5 – Normalization of adverse drug events in English tweets. Accessed October 20,

2023
. https://codalab.lisn.upsaclay.fr/competitions/12941

11

Sarker
A
,
Lakamana
S
,
Guo
Y
, et al.
#ChronicPain: automated building of a chronic pain cohort from Twitter using machine learning
.
Health Data Sci
.
2023
;
3
:
0078
.

12

Weissenbacher
D
,
Banda
J
,
Davydova
V
, et al. Overview of the seventh Social Media Mining for Health Applications (#SMM4H) shared tasks at COLING 2022. In: Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task.
Association for Computational Linguistics
;
2022
:
221
-
241
.

13

Magge
A
,
Klein
A
,
Miranda-Escalada
A
, et al. Overview of the sixth Social Media Mining for Health Applications (#SMM4H) shared tasks at NAACL 2021. In: Proceedings of the Sixth Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2021
:
21
-
32
.

14

Klein
A
,
Alimova
I
,
Flores
I
, et al. Overview of the fifth Social Media Mining for Health Applications (#SMM4H) shared tasks at COLING 2020. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2020
:
27
-
36
.

15

Weissenbacher
D
,
Sarker
A
,
Magge
A
, et al. Overview of the fourth Social Media Mining for Health (SMM4H) shared tasks at ACL 2019. In: Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task.
Association for Computational Linguistics
;
2019
:
21
-
30
.

16

Weissenbacher
D
,
Sarker
A
,
Paul
MJ
, et al. Overview of the third Social Media Mining for Health (SMM4H) shared tasks at EMNLP 2018. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2018
:
13
-
16
.

17

Sarker
A
,
Belousov
M
,
Friedrichs
J
, et al.
Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task
.
J Am Med Inform Assoc
.
2018
;
25
(
10
):
1274
-
1283
.

18

Chavda
R
,
Makwana
D
,
Patel
V
, et al. Shayona@SMM4H23: COVID-19 self diagnosis classification using BERT and LightGBM models. arXiv. 2401.02158. January 4, 2024, preprint.

19

Jiang
Y
,
Qiu
R
,
Zhang
Y
, et al. UQ at #SMM4H 2023: ALEX for public health analysis with social media. arXiv. 2309.04213. September 12,
2023
, preprint.

20

Francis
S
,
Moens
MF.
Text augmentations with R-drop for classification of tweets self reporting Covid-19. arXiv. 2311.03420. November 6, 2023, preprint.

21

Glazkova
A.
TMN at #SMM4H 2023: comparing text preprocessing techniques for detecting tweets self-reporting a COVID-19 diagnosis. arXiv. 2311.00732. November 1,
2023
, preprint.

22

Müller
M
,
Salathé
M
,
Kummervold
PE.
COVID-Twitter-BERT: a natural language processing model to analyse COVID-19 content on Twitter
.
Front Artif Intell
.
2023
;
6
:
1023281
.

23

Liu
Y
,
Ott
M
,
Goyal
N
, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv. 1907.11692. July 26,
2019
, preprint: not peer reviewed.

24

Barbieri
F
,
Camacho-Collados
J
,
Anke
LE
, et al. TweetEval: unified benchmark and comparative evaluation for tweet classification. In: Findings of the Association for Computational Linguistics: EMNLP 2020.
Association for Computational Linguistics
;
2020
:
1644
-
1650
.

25

Yue
X
,
Wang
X
,
He
Y
, et al. Explorers at #SMM4H 2023: enhancing BERT for health applications through knowledge and model fusion. arXiv. 2312.10652. December 17,
2023
, preprint.

26

Nguyen
DQ
,
Vu
T
,
Nguyen
AT.
BERTweet: a pre-trained language model for English tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
Association for Computational Linguistics
;
2020
:
9
-
14
.

27

Gururangan
S
,
Marasović
A
,
Swayamdipta
S
, et al. Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics
;
2020
:
8342
-
8360
.

28

Nayel
H
,
Ashraf
N
,
Aldawsari
M.
BFCI at #SMM4H 2023: integration of machine learning and TF-IDF for Covid-19 tweets analysis. medRxiv. 2023.11.18.23297862. November 20,
2023
, preprint.

29

Kanagasabai
R
,
Veeramani
A.
A generic NLI approach for classification of sentiment associated with therapies. arXiv. 2312.03737. November 28,
2023
, preprint.

30

Brown
TB
,
Mann
B
,
Ryder
N
, et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems.
Association for Computing Machinery
;
2020
:
1877
-
1901
.

31

Singh
S
,
Bedi
J.
ThaparUni at #SMM4H 2023: synergistic ensemble of RoBERTa, XLNet, and ERNIE 2.0 for enhanced textual analysis. medRxiv. 2023.11.10.23298362. November 13,
2023
, preprint.

32

Zanwar
S
,
Wiechmann
D
,
Qiao
Y
, et al. MANTIS at #SMM4H 2023: leveraging hybrid and ensemble models for detection of social anxiety disorder on Reddit. medRxiv. 2023.12.05.23299439. December 5,
2023
, preprint.

33

Ke
G
,
Meng
Q
,
Finley
T
, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems.
Association for Computing Machinery
;
2017
:
3149
-
3157
.

34

Sun
Y
,
Wang
S
,
Li
Y
, et al. ERNIE 2.0: a continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Association for the Advancement of Artificial Intelligence
;
2020
:
8968
-
8975
.

35

Yang
Z
,
Dai
Z
,
Yang
Y
, et al. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems.
Association for Computing Machinery
;
2019
:
5753
-
5763
.

36

Ji
S
,
Zhang
T
,
Ansari
L
, et al. MentalBERT: publicly available pre-trained language models for mental healthcare. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference.
European Language Resources Association
;
2022
:
7184
-
7190
.

37

Vajre
V
,
Naylor
M
,
Kamath
U
, et al. PsychBERT: a mental health language model for social media mental health behavioral analysis. In: Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine.
Institute of Electrical and Electronics Engineers
;
2021
:
1077
-
1082
.

38

Li
J
,
Fei
H
,
Liu
J
, et al. Unified named entity recognition as word-word relation classification. In: Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence.
Association for the Advancement of Artificial Intelligence
;
2022
:
10965
-
10973
.

39

Cañete
J
,
Chaperon
G
,
Fuentes
R
, et al. Spanish pre-trained BERT model and evaluation data. In: Proceedings of the Practical Machine Learning for Developing Countries Workshop.
2020
.

40

Conneau
A
,
Rinott
R
,
Lample
G
, et al. XNLI: evaluating cross-lingual sentence representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics
;
2018
:
2475
-
2485
.

41

Yazdani
A
,
Rouhizadeh
H
,
Alvarez
DV
, et al. DS4DH at #SMM4H 2023: zero-shot adverse drug events normalization using sentence transformers and reciprocal-rank fusion. arXiv. 2308.12877. September 27,
2023
, preprint.

42

Devlin
J
,
Chang
MW
,
Lee
K
, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Association for Computational Linguistics
;
2019
:
4171
-
4186
.

43

Raffel
C
,
Shazeer
N
,
Roberts
A
, et al.
Exploring the limits of transfer learning with a unified text-to-text transformer
.
J Mach Learn Res
.
2020
;
21
(
1
):
5485
-
5551
.

44

Deka
P
,
Jurek-Loughrey
A
,
Deepak
P.
Improved methods to aid unsupervised evidence-based fact checking for online health news
.
J Data Intell
.
2022
;
3
(
4
):
474
-
504
.

45

Reimers
N
,
Gurevych
I.
Sentence-BERT: sentence embeddings using siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.
Association for Computational Linguistics
;
2019
:
3982
-
3992
.

46

Cormack
GV
,
Clarke
CLA
,
Buettcher
S.
Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval.
Association for Computing Machinery
;
2009
:
758
-
759
.

47

Klein
AZ
,
Gutiérrez Gómez
JA
,
Levine
LD
, et al.
Using longitudinal Twitter data for digital epidemiology of childhood health outcomes: an annotated data set and deep neural network classifiers
.
J Med Internet Res
. in press.

48

Klein
AZ
,
Magge
A
,
O’Connor
K
, et al.
Automatically identifying Twitter users for interventions to support dementia family caregivers: annotated data set and benchmark classification models
.
JMIR Aging
.
2022
;
5
(
3
):
e39547
.

49

Klein
AZ
,
Magge
A
,
Gonzalez-Hernandez
G.
ReportAGE: automatically extracting the exact age of Twitter users based on self-reports in tweets
.
PLoS One
.
2022
;
17
(
1
):
e0262087
.

50

Klein
AZ
,
Kunatharaju
S
,
Golder
S
, et al. Association between COVID-19 during pregnancy and preterm birth by trimester of infection: a retrospective cohort study using longitudinal social media data. medRxiv. 2023.11.17.23298696. November 21,
2023
, preprint: not peer reviewed.

51

Zolnoori
M
,
Patrick
TB
,
Fung
KW
, et al. Development of an adverse drug reaction corpus from consumer health posts for psychiatric medications. In: Proceedings of the 2nd Social Media Mining for Health Research and Applications Workshop. CEUR Workshop Proceedings;
2017
:
19
-
26
.

52

Sarker
H
,
Dhuliawala
M
,
Fay
N
, et al. vExplorer: a search method to find relevant YouTube videos for health researchers. In: Proceedings of the 2nd Social Media Mining for Health Research and Applications Workshop. CEUR Workshop Proceedings;
2017
:
32
-
39
.

53

Pless
R
,
Begtrup
R
,
Alkulaib
L
, et al. Recognizing images of eating disorders in social media. In: Proceedings of the 2nd Social Media Mining for Health Research and Applications Workshop. CEUR Workshop Proceedings;
2017
:
42
.

54

Skeppstedt
M
,
Stede
M
,
Kerren
A.
Stance-taking in topics extracted from vaccine-related tweets and discussion forum posts. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2018
:
5
-
8
.

55

Dirkson
A
,
Verberne
S
,
Kraaij
W.
Conversation-aware filtering of online patient forum messages. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2020
:
11
-
18
.

56

Chan
JZM
,
Kunneman
F
,
Morante
R
, et al. Leveraging social media as a source for clinical guidelines: a demarcation of experiential knowledge. In: Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task.
Association for Computational Linguistics
;
2022
:
203
-
208
.

57

Romberg
J
,
Dyczmons
J
,
Borgmann
SO
, et al. Annotating patient information needs in online diabetes forums. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2020
:
19
-
26
.

58

Moßburger
L
,
Wende
F
,
Brinkmann
K
, et al. Exploring online depression forums via text mining: a comparison of Reddit and a curated online forum. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.
Association for Computational Linguistics
;
2020
:
70
-
81
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)