Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium

Klein, Ari Z; Banda, Juan M; Guo, Yuting; Schmidt, Ana Lucia; Xu, Dongfang; Flores Amaro, Ivan; Rodriguez-Esteban, Raul; Sarker, Abeed; Gonzalez-Hernandez, Graciela

doi:10.1093/jamia/ocae010

Abstract

Objective

The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants’ systems, and the performance results.

Methods

The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events).

Results

In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora.

Conclusion

To facilitate future work, the datasets—a total of 61 353 posts—will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.

natural language processing, machine learning, social media, data mining

Background

With more than 70% of adults in the United States¹ and nearly 60% of people worldwide² using social media, the aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing the vast amount of data on social media for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the American Medical Informatics Association (AMIA) 2023 Annual Symposium and consisted of 5 tasks. Task 1 was a binary classification task to distinguish English-language tweets that self-report a COVID-19 diagnosis from those that do not.³ Task 2 was a multi-class classification task to categorize users’ sentiment toward therapies in English-language tweets as positive, negative, or neutral.⁴ Task 3 was a sequence labeling task to extract COVID-19 symptoms in tweets written in Latin American Spanish. Task 4 was a binary classification task to distinguish English-language Reddit posts that self-report a social anxiety disorder diagnosis from those that do not. Task 5 was a normalization task to map adverse drug events (ADEs) in English-language tweets to their standard concept ID in the MedDRA vocabulary.⁵

Teams could register for a single task or multiple tasks. In total, 29 teams registered, representing 17 countries. Teams were provided with gold standard annotated training and validation sets to develop their systems and, subsequently, a blind test set for the final evaluation. After receiving the test set, teams were given 5 days to submit the predictions of their systems to CodaLab^6–10—a platform that facilitates data science competitions—for automatic evaluation, promoting the systematic comparison of performance. Among the 29 teams that registered, 17 teams submitted at least one set of predictions: 8 teams for Task 1, 6 teams for Task 2, 2 teams for Task 3, 8 teams for Task 4, and 3 teams for Task 5. Teams were then invited to submit a short manuscript describing their system, and 12 of the 17 teams did. Each of these 12 system descriptions was peer-reviewed by at least 2 reviewers. In this article, we present the annotated corpora, a technical summary of the systems, and the performance results, providing insights into state-of-the-art methods for mining social media data for health informatics.

Methods

Data collection

We collected a total of 61 353 social media posts for the 5 tasks. For Task 1, the dataset included 10 000 English-language tweets that mentioned a personal reference to the user and keywords related to both COVID-19 and a positive test, diagnosis, or hospitalization.³ For Task 2, the dataset included 5364 English-language tweets that mentioned a total of 32 therapies, such as medication, behavioral, and physical.⁴ These tweets were posted by a cohort of users who self-reported chronic pain on Twitter,¹¹ so it is likely that the sentiments associated with the therapies are being expressed by patients who are actually experiencing them. For Task 3, the dataset included 8861 tweets that were written in Latin American Spanish and reported COVID-19 symptoms of the user or someone known to the user, building on a previous iteration of this task involving the multi-class classification of Spanish-language tweets that mentioned COVID-19 symptoms.¹² For Task 4, the dataset included 5140 English-language Reddit posts in the r/socialanxiety subreddit, written by users aged 13-25 years. For Task 5, the dataset included 29 449 English-language tweets that mentioned medications, with many from previous iterations of this task.^12–17

Annotation

For all 5 tasks, at least a subset of the social media posts was annotated by multiple annotators. Table 1 presents inter-annotator agreement (IAA) and the distribution of the posts in the training, validation, and test sets. For Task 1, 1728 (17.3%) of the tweets were labeled as self-reporting a COVID-19 diagnosis—a positive test, clinical diagnosis, or hospitalization—and 8272 (82.7%) as not.³ For Task 2, 998 (18.6%) of the tweets were labeled as positive sentiment toward a therapy, 619 (11.5%) as negative sentiment, and 3747 (69.9%) as neutral sentiment.⁴ For Task 3, among the 8861 tweets, 10 145 spans of text containing COVID-19 symptoms were annotated by medical doctors who are native speakers of Latin American Spanish. For Task 4, 2428 (38%) of the Reddit posts were labeled as self-reporting a confirmed or probable clinical social anxiety disorder diagnosis and 3962 (62%) as not. For Task 5, 2219 (7.5%) of the tweets were labeled as reporting an ADE and 27 230 (92.5%) as not. Among these 2219 tweets, 3021 spans of text containing an ADE were annotated and labeled with a corresponding MedDRA ID. Among the 1224 ADEs in the test set, 272 (22.2%) were not reported in the training or validation sets.

Table 1.

Open in new tab

Inter-annotator agreement (IAA) and distribution of social media posts in the training, validation, and test sets for the 5 #SMM4H 2023 shared tasks.

Task	Platform	Language	Training	Validation	Test	Total	IAA
1	Twitter	English	7600	400	2000	10 000	0.79^a
2	Twitter	English	3009	753	1602	5364	0.82^b
3	Twitter	Spanish	6021	1979	2150	10 150	0.88^b
4	Reddit	English	4500	640	1250	6390	0.80^b
5	Twitter	English	17 385	915	11 149	29 449	0.68^c

Task	Platform	Language	Training	Validation	Test	Total	IAA
1	Twitter	English	7600	400	2000	10 000	0.79^a
2	Twitter	English	3009	753	1602	5364	0.82^b
3	Twitter	Spanish	6021	1979	2150	10 150	0.88^b
4	Reddit	English	4500	640	1250	6390	0.80^b
5	Twitter	English	17 385	915	11 149	29 449	0.68^c

a

Fleiss’ Kappa.

b

Cohen’s Kappa.

c

F₁-score.

Table 1.

Open in new tab

Inter-annotator agreement (IAA) and distribution of social media posts in the training, validation, and test sets for the 5 #SMM4H 2023 shared tasks.

Task	Platform	Language	Training	Validation	Test	Total	IAA
1	Twitter	English	7600	400	2000	10 000	0.79^a
2	Twitter	English	3009	753	1602	5364	0.82^b
3	Twitter	Spanish	6021	1979	2150	10 150	0.88^b
4	Reddit	English	4500	640	1250	6390	0.80^b
5	Twitter	English	17 385	915	11 149	29 449	0.68^c

Task	Platform	Language	Training	Validation	Test	Total	IAA
1	Twitter	English	7600	400	2000	10 000	0.79^a
2	Twitter	English	3009	753	1602	5364	0.82^b
3	Twitter	Spanish	6021	1979	2150	10 150	0.88^b
4	Reddit	English	4500	640	1250	6390	0.80^b
5	Twitter	English	17 385	915	11 149	29 449	0.68^c

a

Fleiss’ Kappa.

b

Cohen’s Kappa.

c

F₁-score.

Classification

For Task 1, the benchmark system³ and 5 of the 7 teams (Table 2 in Results) that submitted a system description (Shayona,¹⁸ UQ,¹⁹ IICU-DSRG, KUL,²⁰ and TMN²¹) used a classifier based on COVID-Twitter-BERT—a transformer model pre-trained on tweets related to COVID-19.²² Two of these 5 teams used additional BERT-based models for the feature representation (IICU-DSRG) or ensemble learning (TMN²¹); IICU-DSRG used BioMed-RoBERTa-Base,²³ and TMN²¹ used RoBERTa-Large²³ and Twitter-RoBERTa-Base.²⁴ One of the 2 teams that did not use COVID-Twitter-BERT²² (Explorers²⁵) used an ensemble of RoBERTa-Base,²³ CPM-RoBERTa, and BERTweet²⁶ models, following 5-fold cross-validation for each individual model. They also used the tweets in the Task 1 and Task 4 training and validation sets for continued domain-adaptive pre-training of these models.²⁷ The other team (BFCI²⁸) used a Passive Aggressive classifier with bigrams as features in a bag-of-words representation. Three of the teams (UQ,¹⁹ KUL,²⁰ and Explorers²⁵) used techniques for addressing the class imbalance, including data augmentation (UQ¹⁹ and KUL²⁰), over-/under-sampling (UQ¹⁹), and class weights (UQ¹⁹ and Explorers²⁵).

Table 2.

Open in new tab

Performance of benchmark systems and 18 teams’ systems on the test sets for the 5 #SMM4H 2023 shared tasks, with the best performance for each task in bold.

Team	Task 1^a	Task 2^b	Task 3^c	Task 4^d	Task 5a^e	Task 5b^f
ABCD	—	—	—	—	0.188	0.089
BFCI	0.637	0.714	—	—	—	—
CEN	—	—	—	0.718	—	—
DS4DH	—	—	—	—	0.426	0.292
Explorers	0.872	—	0.94	0.695	—	—
HULAT-UC3M	—	0.669	—	—	—	—
I2R-MI	—	0.752	—	—	0.322	0.195
IICU-DSRG	0.898	—	—	—	—	—
ITT	—	—	—	0.728	—	—
KUL	0.877	—	—	—	—	—
Mantis	—	—	—	0.838	—	—
MUCS	0.741	0.185	—	—	—	—
Shayona	0.943	—	—	0.810	—	—
ThaparUni	—	—	—	0.842	—	—
TMN	0.845	—	—	—	—	—
UQ	0.943	0.778	—	0.871	—	—
Unknown-1^g	—	0.695	—	0.841	—	—
Unknown-2^g	—	—	0.89	—	—	—
Task 1 Benchmark	0.938	—	—	—	—	—
Task 3 Benchmark	—	—	0.90	—	—	—
Task 3 Benchmark	—	—	0.92	—	—	—
Task 5 Benchmark	—	—	—	—	0.452	0.363

Team	Task 1^a	Task 2^b	Task 3^c	Task 4^d	Task 5a^e	Task 5b^f
ABCD	—	—	—	—	0.188	0.089
BFCI	0.637	0.714	—	—	—	—
CEN	—	—	—	0.718	—	—
DS4DH	—	—	—	—	0.426	0.292
Explorers	0.872	—	0.94	0.695	—	—
HULAT-UC3M	—	0.669	—	—	—	—
I2R-MI	—	0.752	—	—	0.322	0.195
IICU-DSRG	0.898	—	—	—	—	—
ITT	—	—	—	0.728	—	—
KUL	0.877	—	—	—	—	—
Mantis	—	—	—	0.838	—	—
MUCS	0.741	0.185	—	—	—	—
Shayona	0.943	—	—	0.810	—	—
ThaparUni	—	—	—	0.842	—	—
TMN	0.845	—	—	—	—	—
UQ	0.943	0.778	—	0.871	—	—
Unknown-1^g	—	0.695	—	0.841	—	—
Unknown-2^g	—	—	0.89	—	—	—
Task 1 Benchmark	0.938	—	—	—	—	—
Task 3 Benchmark	—	—	0.90	—	—	—
Task 3 Benchmark	—	—	0.92	—	—	—
Task 5 Benchmark	—	—	—	—	0.452	0.363

a

F₁-score for the class of tweets that self-reported a COVID-19 diagnosis.

b

Micro-averaged F₁-score for all 3 classes of tweets.

c

Strict F₁-score for identifying the character offsets of COVID-19 symptoms.

d

F₁-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis.

e

Micro-averaged F₁-score for all MedDRA IDs.

f

Micro-averaged F₁-score for MedDRA IDs that were not seen during training.

g

CodaLab submission could not be cross-referenced with a registered team name.

Table 2.

Open in new tab

Performance of benchmark systems and 18 teams’ systems on the test sets for the 5 #SMM4H 2023 shared tasks, with the best performance for each task in bold.

Team	Task 1^a	Task 2^b	Task 3^c	Task 4^d	Task 5a^e	Task 5b^f
ABCD	—	—	—	—	0.188	0.089
BFCI	0.637	0.714	—	—	—	—
CEN	—	—	—	0.718	—	—
DS4DH	—	—	—	—	0.426	0.292
Explorers	0.872	—	0.94	0.695	—	—
HULAT-UC3M	—	0.669	—	—	—	—
I2R-MI	—	0.752	—	—	0.322	0.195
IICU-DSRG	0.898	—	—	—	—	—
ITT	—	—	—	0.728	—	—
KUL	0.877	—	—	—	—	—
Mantis	—	—	—	0.838	—	—
MUCS	0.741	0.185	—	—	—	—
Shayona	0.943	—	—	0.810	—	—
ThaparUni	—	—	—	0.842	—	—
TMN	0.845	—	—	—	—	—
UQ	0.943	0.778	—	0.871	—	—
Unknown-1^g	—	0.695	—	0.841	—	—
Unknown-2^g	—	—	0.89	—	—	—
Task 1 Benchmark	0.938	—	—	—	—	—
Task 3 Benchmark	—	—	0.90	—	—	—
Task 3 Benchmark	—	—	0.92	—	—	—
Task 5 Benchmark	—	—	—	—	0.452	0.363

Team	Task 1^a	Task 2^b	Task 3^c	Task 4^d	Task 5a^e	Task 5b^f
ABCD	—	—	—	—	0.188	0.089
BFCI	0.637	0.714	—	—	—	—
CEN	—	—	—	0.718	—	—
DS4DH	—	—	—	—	0.426	0.292
Explorers	0.872	—	0.94	0.695	—	—
HULAT-UC3M	—	0.669	—	—	—	—
I2R-MI	—	0.752	—	—	0.322	0.195
IICU-DSRG	0.898	—	—	—	—	—
ITT	—	—	—	0.728	—	—
KUL	0.877	—	—	—	—	—
Mantis	—	—	—	0.838	—	—
MUCS	0.741	0.185	—	—	—	—
Shayona	0.943	—	—	0.810	—	—
ThaparUni	—	—	—	0.842	—	—
TMN	0.845	—	—	—	—	—
UQ	0.943	0.778	—	0.871	—	—
Unknown-1^g	—	0.695	—	0.841	—	—
Unknown-2^g	—	—	0.89	—	—	—
Task 1 Benchmark	0.938	—	—	—	—	—
Task 3 Benchmark	—	—	0.90	—	—	—
Task 3 Benchmark	—	—	0.92	—	—	—
Task 5 Benchmark	—	—	—	—	0.452	0.363

a

F₁-score for the class of tweets that self-reported a COVID-19 diagnosis.

b

Micro-averaged F₁-score for all 3 classes of tweets.

c

Strict F₁-score for identifying the character offsets of COVID-19 symptoms.

d

F₁-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis.

e

Micro-averaged F₁-score for all MedDRA IDs.

f

Micro-averaged F₁-score for MedDRA IDs that were not seen during training.

g

CodaLab submission could not be cross-referenced with a registered team name.

For Task 2, 2 of the 4 teams that submitted a system description (UQ¹⁹ and I2R-MI²⁹) used BERT-based classifiers: BERTweet-Large²⁶ (UQ¹⁹) and RoBERTa-Base²³ (IR2-MI²⁹). These teams also used techniques for addressing the class imbalance, including data augmentation (UQ¹⁹), over-sampling (UQ¹⁹), under-sampling (UQ¹⁹ and IR2-MI²⁹), and class weights (UQ¹⁹). Furthermore, UQ¹⁹ used a large language model (LLM), GPT-3.5,³⁰ to verify the predictions of the BERTweet-Large²⁶ classifier, automatically changing the prediction to the majority (ie, neutral sentiment) class if the LLM did not find evidence in the tweet to support the classification of positive or negative sentiment. The 2 teams that did not use a BERT-based classifier (BFCI²⁸ and HULAT-UC3M) used a Multilayer Perceptron classifier with 5-fold cross-validation (BFCI²⁸) and a Random Forest classifier with feature engineering (HULAT-UC3M).

For Task 4, all 5 of the teams that submitted a system description (UQ,¹⁹ ThaparUni,³¹ Mantis,³² Shayona,¹⁸ and Explorers²⁵) used BERT-based models, with 2 of the teams (UQ¹⁹ and Explorers²⁵) using BERTweet²⁶ models. As in Task 2, UQ¹⁹ used techniques for addressing the class imbalance and a LLM to correct the predictions of the BERTweet²⁶ classifier. As in Task 1, Explorers²⁵ used BERTweet²⁶ in an ensemble with RoBERTa-Base²³ and CPM-RoBERTa models, domain-adaptive pre-training,²⁷ and techniques for addressing the class imbalance. Also as in Task 1, Shayona¹⁸ used a COVID-Twitter-BERT²² model with gradient boosting.³³ Along with Explorers,²⁵ the other 2 teams (ThaparUni³¹ and Mantis³²) used an ensemble of pre-trained transformer models; while ThaparUni³¹ used RoBERTa,²³ ERNIE 2.0,³⁴ and XLNet³⁵ models, which were pre-trained on general-domain corpora, Mantis³² used MentalRoBERTa³⁶ and PyschBERT³⁷ models, which were pre-trained on corpora related to mental health, including Reddit posts.

Extraction

For Task 3, the one team that submitted a system description (Explorers²⁵) used the W²NER framework³⁸ for named entity recognition and an ensemble of 3 Spanish BERT-based models: Spanish BERT (BETO),³⁹ BETO+NER, and a version of BETO that was fine-tuned using the Spanish portion of the XLNI corpus.⁴⁰ As in Task 1 and Task 4, they also used the tweets in the training and validation sets for continued domain-adaptive pre-training²⁷ of these models. Two benchmark systems used BETO³⁹ and COVID-Twitter-BERT²² models. For Task 5, the benchmark system⁴ and 2 teams that submitted a system description (DS4DH⁴¹ and I2R-MI²⁹) also used pre-trained transformer models, in a pipeline approach that involves extracting ADEs as input for mapping the tweets to MedDRA terms: BERT⁴² (benchmark system⁵), BERTweet²⁶ (DS4DH⁴¹), and T5⁴³ (I2R-MI²⁹).

Normalization

For Task 5, the 2 teams that submitted a system description (DS4DH⁴¹ and I2R-MI²⁹) took similar approaches to ADE normalization, using similarity metrics to compare the vector representations of extracted ADEs to MedDRA terms, based on pre-trained sentence transformer models. DS4DH⁴¹ used multiple sentence transformers to represent each extracted ADE—S-PubMedBERT,⁴⁴ All-MPNet-Base-v2,⁴⁵ All-DistilRoBERTa-v1,⁴⁵ All-MiniLM-L6-v2,⁴⁵ and a custom sentence transformer that integrates knowledge from the Unified Medical Language System (UMLS) Metathesaurus—and aggregated their similarity scores via reciprocal-rank fusion,⁴⁶ while I2R-MI²⁹ used only a single SBERT⁴⁵ model for the similarity ranking. In contrast, the benchmark system⁵ took a learning-based approach, using a BERT⁴² model for multi-class classification.

Results and discussion

Table 2 presents the performance of 18 teams’ systems on the test sets for the 5 tasks. For Task 1, the evaluation metric was the F₁-score for the class of tweets that self-reported a COVID-19 diagnosis. Shayona¹⁸ and UQ¹⁹ achieved nearly identical F₁-scores (0.943) using a COVID-Twitter-BERT²² model, with Shayona¹⁸ marginally outperforming UQ¹⁹; however, neither gradient boosting³³ (Shayona¹⁸) nor techniques for addressing the class imbalance (UQ¹⁹) substantially improved performance over a COVID-Twitter-BERT benchmark classifier³ (0.938), which was trained using the Flair library with 5 epochs, a batch size of 8, and a learning rate of 1e-5. While the recall of Shayona’s¹⁸ (0.938) and UQ’s¹⁹ (0.950) systems outperformed that of the benchmark system (0.914), the benchmark system³ achieved the highest precision (0.962) among all of the systems. These 3 systems outperformed IICU-DSRG’s and TMN’s²¹ systems that used COVID-Twitter-BERT²² with additional BERT-based models. While KUL²⁰ did not use additional models, the lower performance of their system (0.877) may be a result of the variation in hyperparameters used for fine-tuning COVID-Twitter-BERT²²: 10 epochs, a batch size of 32, and a learning rate of 5e-5. All of the systems that used BERT-based classifiers substantially outperformed BFCI’s²⁸ Passive Aggressive classifier (0.637).

For Task 2, the evaluation metric was the micro-averaged F₁-score for all 3 classes of tweets: positive sentiment, negative sentiment, and neutral sentiment. UQ¹⁹ achieved not only the highest micro-averaged F₁-score (0.778), but also the highest F₁-score for each of the positive (0.611), negative (0.415), and neutral (0.859) classes. An ablation study by UQ¹⁹ showed that the use of a LLM³⁰ to verify the predictions of the BERTweet-Large²⁶ classifier improved the micro-averaged F₁-score by more than 3 points, contributing to outperforming I2R-MI’s²⁹ RoBERTa-Base²³ classifier (0.752). UQ’s¹⁹ and I2R-MI’s²⁹ BERT-based classifiers outperformed BFCI’s²⁸ Multilayer Perceptron (0.714) and HULAT-UC3M’s Random Forest (0.669) classifiers.

For Task 3, the evaluation metric was the strict F₁-score for identifying the character offsets of COVID-19 symptoms. Explorers’²⁵ ensemble of 3 Spanish BERT-based models and continued domain-adaptive pre-training²⁷ of these models achieved not only a higher F₁-score (0.94) than that of BETO³⁹ (0.90) and COVID-Twitter-BERT²² (0.92) benchmark systems, but also a higher precision (0.94) and recall (0.93). Despite the high performance of Explorers’²⁵ system, their error analysis revealed that colloquial/regional spelling variants of symptoms remain a challenge; for example, their system was able to identify gripe, but not gripa and gripes.

For Task 4, the evaluation metric was the F₁-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis. Using the same methods as in Task 2, UQ¹⁹ also achieved the highest F₁-score (0.871) for Task 4. As with UQ,¹⁹ Explorers²⁵ also used BERTweet²⁶ and techniques for addressing the class imbalance, though they used it an ensemble with 2 additional BERT-based models and achieved a substantially lower F₁-score (0.695). ThaparUni’s³¹ ensemble of 3 transformer models that were pre-trained on general-domain corpora (0.842)—RoBERTa,²³ ERNIE 2.0,³⁴ and XLNet³⁵—marginally outperformed Mantis’³² ensemble of 2 transformer models that were pre-trained on corpora related to mental health, including Reddit posts (0.838)—MentalRoBERTa³⁶ and PyschBERT.³⁷ Both of these ensemble-based systems outperformed Shayona’s¹⁸ COVID-Twitter-BERT²² classifier (0.810), suggesting that this model may not generalize to non-COVID-19 diagnoses.

To enable novel approaches for Task 5, in contrast to previous iterations of this task that used a pipeline-based evaluation,^12–17 the evaluation did not include the output of classification or extraction. The primary evaluation metric was the micro-averaged F₁-score for all MedDRA IDs in the test set. A secondary evaluation metric was based on a zero-shot learning setup—that is, for MedDRA IDs in the test set that were not seen during training. While DS4DH⁴¹ achieved the highest overall precision (0.449) and I2R-MI²⁹ achieved the highest recall for the unseen ADEs (0.406), the benchmark system⁵ achieved the highest overall recall (0.508) and F₁-score (0.452) and the highest precision (0.335) and F₁-score (0.363) for the unseen ADEs. In contrast to DS4DH’s⁴¹ and I2R-MI’s²⁹ 2-component pipelines, the benchmark system’s pipeline includes an initial binary classification component to detect tweets that mention an ADE, pre-filtering the tweets for downstream extraction and normalization.⁵

In general, the top-performing team for each of the 5 tasks used a deep neural network architecture based on pre-trained transformer models. In total, 10 of the 12 teams that submitted a system description used pre-trained transformer models. The 2 teams that used feature-engineered algorithms were outperformed by all of the teams that used pre-trained transformer models. In particular, the top-performing team for each of the 3 classification tasks used a model that was pre-trained on a social media corpus. While half—5 of the 10—teams that participated in the classification tasks used an ensemble-based system, the top-performing system for each of these 3 tasks was based on a single model. In addition, the techniques that 4 of these 10 teams used for addressing the class imbalance did not appear to substantially improve performance. In contrast, for the second iteration of the #SMM4H shared tasks, which was hosted at the AMIA 2017 Annual Symposium, deep neural networks were used in only approximately half of the systems and were outperformed by Support Vector Machine (SVM) and Logistic Regression classifiers on highly imbalanced data.¹⁷ Although none of the teams used SVMs for the #SMM4H 2023 shared tasks, our recent work^47–49 has demonstrated that BERT-based classifiers outperform SVM classifiers for health-related tasks using social media data.

The high level of performance that can be achieved with recent advances in deep learning and pre-trained transformer models enables the large-scale use of social media as a complementary source of real-world data for research applications. For example, the benchmark classifier (F₁-score = 0.938) that was developed for Task 1³ was deployed on the timelines of more than 67 000 users who had reported their pregnancy on Twitter, for a retrospective cohort study that used self-reports in longitudinal social media data to assess the association between COVID-19 infection during pregnancy and preterm birth.⁵⁰ This cohort study also points to the fact that, despite recent changes to the accessibility of data through the Twitter and Reddit APIs, data that were previously collected continue to have utility. For collecting new data, however, these changes may motivate researchers to explore additional platforms that have been proposed in the #SMM4H Workshop, such as WebMD,¹² Ask a Patient,⁵¹ YouTube,⁵² Tumblr,⁵³ Mumsnet,⁵⁴ MedHelp,⁵⁵ Facebook,⁵⁵^,⁵⁶ Diabetes Info,⁵⁷ and Beyond Blue.⁵⁸

Conclusion

This article presented an overview of the #SMM4H 2023 shared tasks, which represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and ADEs). To facilitate future work, the datasets will remain available by request, and the CodaLab sites^6–10 will remain active to automatically evaluate new systems against the blind test sets, promoting the ongoing systematic comparison of performance.

Acknowledgments

The authors thank those who contributed to annotating the data, the program committee of the #SMM4H 2023 Workshop, and additional peer reviewers of the system description papers.

Author contributions

A.Z.K.: data collection, annotation, benchmark system, system description review, and wrote the manuscript. J.M.B.: shared task conceptualization, data collection, annotation, benchmark system, CodaLab site, system description review, and wrote the manuscript. Y.G.: data collection, annotation, CodaLab site, system description review, and wrote the manuscript. A.L.S.: data collection, annotation, CodaLab site, system description review, and wrote the manuscript. D.X.: data collection, annotation, benchmark system, CodaLab site, system description review, and wrote the manuscript. J.I.F.A.: data collection, CodaLab site, and edited the manuscript. R.R.-E.: shared task conceptualization and edited the manuscript. A.S.: shared task conceptualization and edited the manuscript. G.G.-H.: shared task conceptualization and guidance, and edited the manuscript.

Funding

A.Z.K., I.F.A., D.X., and G.G.-H. were supported in part by the National Library of Medicine (R01LM011176). Y.G. and A.S. were supported in part by the National Institute on Drug Abuse (R01DA057599). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. J.M.B. was supported in part by a Google Award for Inclusion Research (AIR).

Conflicts of interest

None declared.

Data availability

According to the Twitter Terms of Service, the content (eg, text) of Tweet Objects cannot be made publicly available; however, a limited number of Tweet Objects are permitted to be shared directly. Requests for data can be sent to A.Z.K. ([email protected]) or G.G.-H. ([email protected]).

References

1

Auxier

B

,

Anderson

M.

Social media use in 2021. Pew Research Center. 7 April

2021

. Accessed October 20, 2023. https://www.pewresearch.org/internet/2021/04/07/social-media-use-in-2021/

2

Dixon

SJ.

Number of global social network users 2017-2027. Statista. 29 August

2023

. Accessed October 20, 2023. https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/

3

Klein

AZ

,

Kunatharaju

S

,

O’Connor

K

, et al.

Automatically identifying self-reports of COVID-19 diagnosis on Twitter: an annotated data set, deep neural network classifiers, and a large-scale cohort

.

J Med Internet Res

.

2023

;

25

:

e46484

.

4

Guo

Y

,

Das

S

,

Lakamana

S

, et al.

An aspect-level sentiment analysis dataset for therapies on Twitter

.

Data Brief

.

2023

;

50

:

109618

.

5

Magge

A

,

Tutubalina

E

,

Miftahutdinov

Z

, et al.

DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter

.

J Am Med Inform Assoc

.

2021

;

28

(

10

):

2184

-

2192

.

6

CodaLab. SMM4H 2023 Task 1 – Binary classification of English tweets self-reporting a COVID-19 diagnosis. Accessed October 20,

2023

. https://codalab.lisn.upsaclay.fr/competitions/12763

7

CodaLab. SMM4H 2023 Task 2: Classification of sentiment associated with therapies (aspect-oriented). Accessed October 20, 2023. https://codalab.lisn.upsaclay.fr/competitions/12421

8

CodaLab. SMM4H 2023 Task 3 – Extraction of COVID-19 symptoms in Latin American Spanish tweets. Accessed October 20,

2023

. https://codalab.lisn.upsaclay.fr/competitions/12901

9

CodaLab. SMM4H23 Task 4 – Binary classification of posts self-reporting a social anxiety disorder diagnosis. Accessed October 20, 2023. https://codalab.lisn.upsaclay.fr/competitions/13160

10

CodaLab. SMM4H 2023 Task 5 – Normalization of adverse drug events in English tweets. Accessed October 20,

2023

. https://codalab.lisn.upsaclay.fr/competitions/12941

11

Sarker

A

,

Lakamana

S

,

Guo

Y

, et al.

#ChronicPain: automated building of a chronic pain cohort from Twitter using machine learning

.

Health Data Sci

.

2023

;

3

:

0078

.

12

Weissenbacher

D

,

Banda

J

,

Davydova

V

, et al. Overview of the seventh Social Media Mining for Health Applications (#SMM4H) shared tasks at COLING 2022. In: Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task.

Association for Computational Linguistics

;

2022

:

221

-

241

.

13

Magge

A

,

Klein

A

,

Miranda-Escalada

A

, et al. Overview of the sixth Social Media Mining for Health Applications (#SMM4H) shared tasks at NAACL 2021. In: Proceedings of the Sixth Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2021

:

21

-

32

.

14

Klein

A

,

Alimova

I

,

Flores

I

, et al. Overview of the fifth Social Media Mining for Health Applications (#SMM4H) shared tasks at COLING 2020. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2020

:

27

-

36

.

15

Weissenbacher

D

,

Sarker

A

,

Magge

A

, et al. Overview of the fourth Social Media Mining for Health (SMM4H) shared tasks at ACL 2019. In: Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task.

Association for Computational Linguistics

;

2019

:

21

-

30

.

16

Weissenbacher

D

,

Sarker

A

,

Paul

MJ

, et al. Overview of the third Social Media Mining for Health (SMM4H) shared tasks at EMNLP 2018. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2018

:

13

-

16

.

17

Sarker

A

,

Belousov

M

,

Friedrichs

J

, et al.

Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task

.

J Am Med Inform Assoc

.

2018

;

25

(

10

):

1274

-

1283

.

18

Chavda

R

,

Makwana

D

,

Patel

V

, et al. Shayona@SMM4H23: COVID-19 self diagnosis classification using BERT and LightGBM models. arXiv. 2401.02158. January 4, 2024, preprint.

19

Jiang

Y

,

Qiu

R

,

Zhang

Y

, et al. UQ at #SMM4H 2023: ALEX for public health analysis with social media. arXiv. 2309.04213. September 12,

2023

, preprint.

20

Francis

S

,

Moens

MF.

Text augmentations with R-drop for classification of tweets self reporting Covid-19. arXiv. 2311.03420. November 6, 2023, preprint.

21

Glazkova

A.

TMN at #SMM4H 2023: comparing text preprocessing techniques for detecting tweets self-reporting a COVID-19 diagnosis. arXiv. 2311.00732. November 1,

2023

, preprint.

22

Müller

M

,

Salathé

M

,

Kummervold

PE.

COVID-Twitter-BERT: a natural language processing model to analyse COVID-19 content on Twitter

.

Front Artif Intell

.

2023

;

6

:

1023281

.

23

Liu

Y

,

Ott

M

,

Goyal

N

, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv. 1907.11692. July 26,

2019

, preprint: not peer reviewed.

24

Barbieri

F

,

Camacho-Collados

J

,

Anke

LE

, et al. TweetEval: unified benchmark and comparative evaluation for tweet classification. In: Findings of the Association for Computational Linguistics: EMNLP 2020.

Association for Computational Linguistics

;

2020

:

1644

-

1650

.

25

Yue

X

,

Wang

X

,

He

Y

, et al. Explorers at #SMM4H 2023: enhancing BERT for health applications through knowledge and model fusion. arXiv. 2312.10652. December 17,

2023

, preprint.

26

Nguyen

DQ

,

Vu

T

,

Nguyen

AT.

BERTweet: a pre-trained language model for English tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

Association for Computational Linguistics

;

2020

:

9

-

14

.

27

Gururangan

S

,

Marasović

A

,

Swayamdipta

S

, et al. Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Association for Computational Linguistics

;

2020

:

8342

-

8360

.

28

Nayel

H

,

Ashraf

N

,

Aldawsari

M.

BFCI at #SMM4H 2023: integration of machine learning and TF-IDF for Covid-19 tweets analysis. medRxiv. 2023.11.18.23297862. November 20,

2023

, preprint.

29

Kanagasabai

R

,

Veeramani

A.

A generic NLI approach for classification of sentiment associated with therapies. arXiv. 2312.03737. November 28,

2023

, preprint.

30

Brown

TB

,

Mann

B

,

Ryder

N

, et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems.

Association for Computing Machinery

;

2020

:

1877

-

1901

.

31

Singh

S

,

Bedi

J.

ThaparUni at #SMM4H 2023: synergistic ensemble of RoBERTa, XLNet, and ERNIE 2.0 for enhanced textual analysis. medRxiv. 2023.11.10.23298362. November 13,

2023

, preprint.

32

Zanwar

S

,

Wiechmann

D

,

Qiao

Y

, et al. MANTIS at #SMM4H 2023: leveraging hybrid and ensemble models for detection of social anxiety disorder on Reddit. medRxiv. 2023.12.05.23299439. December 5,

2023

, preprint.

33

Ke

G

,

Meng

Q

,

Finley

T

, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems.

Association for Computing Machinery

;

2017

:

3149

-

3157

.

34

Sun

Y

,

Wang

S

,

Li

Y

, et al. ERNIE 2.0: a continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence.

Association for the Advancement of Artificial Intelligence

;

2020

:

8968

-

8975

.

35

Yang

Z

,

Dai

Z

,

Yang

Y

, et al. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems.

Association for Computing Machinery

;

2019

:

5753

-

5763

.

36

Ji

S

,

Zhang

T

,

Ansari

L

, et al. MentalBERT: publicly available pre-trained language models for mental healthcare. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference.

European Language Resources Association

;

2022

:

7184

-

7190

.

37

Vajre

V

,

Naylor

M

,

Kamath

U

, et al. PsychBERT: a mental health language model for social media mental health behavioral analysis. In: Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine.

Institute of Electrical and Electronics Engineers

;

2021

:

1077

-

1082

.

38

Li

J

,

Fei

H

,

Liu

J

, et al. Unified named entity recognition as word-word relation classification. In: Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence.

Association for the Advancement of Artificial Intelligence

;

2022

:

10965

-

10973

.

39

Cañete

J

,

Chaperon

G

,

Fuentes

R

, et al. Spanish pre-trained BERT model and evaluation data. In: Proceedings of the Practical Machine Learning for Developing Countries Workshop.

2020

.

40

Conneau

A

,

Rinott

R

,

Lample

G

, et al. XNLI: evaluating cross-lingual sentence representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

Association for Computational Linguistics

;

2018

:

2475

-

2485

.

41

Yazdani

A

,

Rouhizadeh

H

,

Alvarez

DV

, et al. DS4DH at #SMM4H 2023: zero-shot adverse drug events normalization using sentence transformers and reciprocal-rank fusion. arXiv. 2308.12877. September 27,

2023

, preprint.

42

Devlin

J

,

Chang

MW

,

Lee

K

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Association for Computational Linguistics

;

2019

:

4171

-

4186

.

43

Raffel

C

,

Shazeer

N

,

Roberts

A

, et al.

Exploring the limits of transfer learning with a unified text-to-text transformer

.

J Mach Learn Res

.

2020

;

21

(

1

):

5485

-

5551

.

Google Scholar

OpenURL Placeholder Text

WorldCat

44

Deka

P

,

Jurek-Loughrey

A

,

Deepak

P.

Improved methods to aid unsupervised evidence-based fact checking for online health news

.

J Data Intell

.

2022

;

3

(

4

):

474

-

504

.

Google Scholar

Crossref

WorldCat

45

Reimers

N

,

Gurevych

I.

Sentence-BERT: sentence embeddings using siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.

Association for Computational Linguistics

;

2019

:

3982

-

3992

.

46

Cormack

GV

,

Clarke

CLA

,

Buettcher

S.

Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval.

Association for Computing Machinery

;

2009

:

758

-

759

.

47

Klein

AZ

,

Gutiérrez Gómez

JA

,

Levine

LD

, et al.

Using longitudinal Twitter data for digital epidemiology of childhood health outcomes: an annotated data set and deep neural network classifiers

.

J Med Internet Res

. in press.

OpenURL Placeholder Text

WorldCat

48

Klein

AZ

,

Magge

A

,

O’Connor

K

, et al.

Automatically identifying Twitter users for interventions to support dementia family caregivers: annotated data set and benchmark classification models

.

JMIR Aging

.

2022

;

5

(

3

):

e39547

.

49

Klein

AZ

,

Magge

A

,

Gonzalez-Hernandez

G.

ReportAGE: automatically extracting the exact age of Twitter users based on self-reports in tweets

.

PLoS One

.

2022

;

17

(

1

):

e0262087

.

50

Klein

AZ

,

Kunatharaju

S

,

Golder

S

, et al. Association between COVID-19 during pregnancy and preterm birth by trimester of infection: a retrospective cohort study using longitudinal social media data. medRxiv. 2023.11.17.23298696. November 21,

2023

, preprint: not peer reviewed.

51

Zolnoori

M

,

Patrick

TB

,

Fung

KW

, et al. Development of an adverse drug reaction corpus from consumer health posts for psychiatric medications. In: Proceedings of the 2nd Social Media Mining for Health Research and Applications Workshop. CEUR Workshop Proceedings;

2017

:

19

-

26

.

52

Sarker

H

,

Dhuliawala

M

,

Fay

N

, et al. vExplorer: a search method to find relevant YouTube videos for health researchers. In: Proceedings of the 2nd Social Media Mining for Health Research and Applications Workshop. CEUR Workshop Proceedings;

2017

:

32

-

39

.

53

Pless

R

,

Begtrup

R

,

Alkulaib

L

, et al. Recognizing images of eating disorders in social media. In: Proceedings of the 2nd Social Media Mining for Health Research and Applications Workshop. CEUR Workshop Proceedings;

2017

:

42

.

54

Skeppstedt

M

,

Stede

M

,

Kerren

A.

Stance-taking in topics extracted from vaccine-related tweets and discussion forum posts. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2018

:

5

-

8

.

55

Dirkson

A

,

Verberne

S

,

Kraaij

W.

Conversation-aware filtering of online patient forum messages. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2020

:

11

-

18

.

56

Chan

JZM

,

Kunneman

F

,

Morante

R

, et al. Leveraging social media as a source for clinical guidelines: a demarcation of experiential knowledge. In: Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task.

Association for Computational Linguistics

;

2022

:

203

-

208

.

57

Romberg

J

,

Dyczmons

J

,

Borgmann

SO

, et al. Annotating patient information needs in online diabetes forums. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2020

:

19

-

26

.

58

Moßburger

L

,

Wende

F

,

Brinkmann

K

, et al. Exploring online depression forums via text mining: a comparison of Reddit and a curated online forum. In: Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task.

Association for Computational Linguistics

;

2020

:

70

-

81

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Download all slides

Month:	Total Views:
January 2024	77
February 2024	577
March 2024	386
April 2024	504
May 2024	629
June 2024	463
July 2024	473
August 2024	55
September 2024	54
October 2024	55
November 2024	65
December 2024	312
January 2025	325
February 2025	266
March 2025	82
April 2025	119
May 2025	42

Article Contents

Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium

Abstract

Background

Methods

Data collection

Annotation

Classification

Extraction

Normalization

Results and discussion

Conclusion

Acknowledgments

Author contributions

Funding

Conflicts of interest

Data availability

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium Free

Abstract

Background

Methods

Data collection

Annotation

Classification

Extraction

Normalization

Results and discussion

Conclusion

Acknowledgments

Author contributions

Funding

Conflicts of interest

Data availability

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium