-
PDF
- Split View
-
Views
-
Cite
Cite
Ari Z Klein, Juan M Banda, Yuting Guo, Ana Lucia Schmidt, Dongfang Xu, Ivan Flores Amaro, Raul Rodriguez-Esteban, Abeed Sarker, Graciela Gonzalez-Hernandez, Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium, Journal of the American Medical Informatics Association, Volume 31, Issue 4, April 2024, Pages 991–996, https://doi.org/10.1093/jamia/ocae010
- Share Icon Share
Abstract
The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants’ systems, and the performance results.
The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events).
In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora.
To facilitate future work, the datasets—a total of 61 353 posts—will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.
Background
With more than 70% of adults in the United States1 and nearly 60% of people worldwide2 using social media, the aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing the vast amount of data on social media for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the American Medical Informatics Association (AMIA) 2023 Annual Symposium and consisted of 5 tasks. Task 1 was a binary classification task to distinguish English-language tweets that self-report a COVID-19 diagnosis from those that do not.3 Task 2 was a multi-class classification task to categorize users’ sentiment toward therapies in English-language tweets as positive, negative, or neutral.4 Task 3 was a sequence labeling task to extract COVID-19 symptoms in tweets written in Latin American Spanish. Task 4 was a binary classification task to distinguish English-language Reddit posts that self-report a social anxiety disorder diagnosis from those that do not. Task 5 was a normalization task to map adverse drug events (ADEs) in English-language tweets to their standard concept ID in the MedDRA vocabulary.5
Teams could register for a single task or multiple tasks. In total, 29 teams registered, representing 17 countries. Teams were provided with gold standard annotated training and validation sets to develop their systems and, subsequently, a blind test set for the final evaluation. After receiving the test set, teams were given 5 days to submit the predictions of their systems to CodaLab6–10—a platform that facilitates data science competitions—for automatic evaluation, promoting the systematic comparison of performance. Among the 29 teams that registered, 17 teams submitted at least one set of predictions: 8 teams for Task 1, 6 teams for Task 2, 2 teams for Task 3, 8 teams for Task 4, and 3 teams for Task 5. Teams were then invited to submit a short manuscript describing their system, and 12 of the 17 teams did. Each of these 12 system descriptions was peer-reviewed by at least 2 reviewers. In this article, we present the annotated corpora, a technical summary of the systems, and the performance results, providing insights into state-of-the-art methods for mining social media data for health informatics.
Methods
Data collection
We collected a total of 61 353 social media posts for the 5 tasks. For Task 1, the dataset included 10 000 English-language tweets that mentioned a personal reference to the user and keywords related to both COVID-19 and a positive test, diagnosis, or hospitalization.3 For Task 2, the dataset included 5364 English-language tweets that mentioned a total of 32 therapies, such as medication, behavioral, and physical.4 These tweets were posted by a cohort of users who self-reported chronic pain on Twitter,11 so it is likely that the sentiments associated with the therapies are being expressed by patients who are actually experiencing them. For Task 3, the dataset included 8861 tweets that were written in Latin American Spanish and reported COVID-19 symptoms of the user or someone known to the user, building on a previous iteration of this task involving the multi-class classification of Spanish-language tweets that mentioned COVID-19 symptoms.12 For Task 4, the dataset included 5140 English-language Reddit posts in the r/socialanxiety subreddit, written by users aged 13-25 years. For Task 5, the dataset included 29 449 English-language tweets that mentioned medications, with many from previous iterations of this task.12–17
Annotation
For all 5 tasks, at least a subset of the social media posts was annotated by multiple annotators. Table 1 presents inter-annotator agreement (IAA) and the distribution of the posts in the training, validation, and test sets. For Task 1, 1728 (17.3%) of the tweets were labeled as self-reporting a COVID-19 diagnosis—a positive test, clinical diagnosis, or hospitalization—and 8272 (82.7%) as not.3 For Task 2, 998 (18.6%) of the tweets were labeled as positive sentiment toward a therapy, 619 (11.5%) as negative sentiment, and 3747 (69.9%) as neutral sentiment.4 For Task 3, among the 8861 tweets, 10 145 spans of text containing COVID-19 symptoms were annotated by medical doctors who are native speakers of Latin American Spanish. For Task 4, 2428 (38%) of the Reddit posts were labeled as self-reporting a confirmed or probable clinical social anxiety disorder diagnosis and 3962 (62%) as not. For Task 5, 2219 (7.5%) of the tweets were labeled as reporting an ADE and 27 230 (92.5%) as not. Among these 2219 tweets, 3021 spans of text containing an ADE were annotated and labeled with a corresponding MedDRA ID. Among the 1224 ADEs in the test set, 272 (22.2%) were not reported in the training or validation sets.
Inter-annotator agreement (IAA) and distribution of social media posts in the training, validation, and test sets for the 5 #SMM4H 2023 shared tasks.
Task . | Platform . | Language . | Training . | Validation . | Test . | Total . | IAA . |
---|---|---|---|---|---|---|---|
1 | English | 7600 | 400 | 2000 | 10 000 | 0.79a | |
2 | English | 3009 | 753 | 1602 | 5364 | 0.82b | |
3 | Spanish | 6021 | 1979 | 2150 | 10 150 | 0.88b | |
4 | English | 4500 | 640 | 1250 | 6390 | 0.80b | |
5 | English | 17 385 | 915 | 11 149 | 29 449 | 0.68c |
Task . | Platform . | Language . | Training . | Validation . | Test . | Total . | IAA . |
---|---|---|---|---|---|---|---|
1 | English | 7600 | 400 | 2000 | 10 000 | 0.79a | |
2 | English | 3009 | 753 | 1602 | 5364 | 0.82b | |
3 | Spanish | 6021 | 1979 | 2150 | 10 150 | 0.88b | |
4 | English | 4500 | 640 | 1250 | 6390 | 0.80b | |
5 | English | 17 385 | 915 | 11 149 | 29 449 | 0.68c |
Fleiss’ Kappa.
Cohen’s Kappa.
F1-score.
Inter-annotator agreement (IAA) and distribution of social media posts in the training, validation, and test sets for the 5 #SMM4H 2023 shared tasks.
Task . | Platform . | Language . | Training . | Validation . | Test . | Total . | IAA . |
---|---|---|---|---|---|---|---|
1 | English | 7600 | 400 | 2000 | 10 000 | 0.79a | |
2 | English | 3009 | 753 | 1602 | 5364 | 0.82b | |
3 | Spanish | 6021 | 1979 | 2150 | 10 150 | 0.88b | |
4 | English | 4500 | 640 | 1250 | 6390 | 0.80b | |
5 | English | 17 385 | 915 | 11 149 | 29 449 | 0.68c |
Task . | Platform . | Language . | Training . | Validation . | Test . | Total . | IAA . |
---|---|---|---|---|---|---|---|
1 | English | 7600 | 400 | 2000 | 10 000 | 0.79a | |
2 | English | 3009 | 753 | 1602 | 5364 | 0.82b | |
3 | Spanish | 6021 | 1979 | 2150 | 10 150 | 0.88b | |
4 | English | 4500 | 640 | 1250 | 6390 | 0.80b | |
5 | English | 17 385 | 915 | 11 149 | 29 449 | 0.68c |
Fleiss’ Kappa.
Cohen’s Kappa.
F1-score.
Classification
For Task 1, the benchmark system3 and 5 of the 7 teams (Table 2 in Results) that submitted a system description (Shayona,18 UQ,19 IICU-DSRG, KUL,20 and TMN21) used a classifier based on COVID-Twitter-BERT—a transformer model pre-trained on tweets related to COVID-19.22 Two of these 5 teams used additional BERT-based models for the feature representation (IICU-DSRG) or ensemble learning (TMN21); IICU-DSRG used BioMed-RoBERTa-Base,23 and TMN21 used RoBERTa-Large23 and Twitter-RoBERTa-Base.24 One of the 2 teams that did not use COVID-Twitter-BERT22 (Explorers25) used an ensemble of RoBERTa-Base,23 CPM-RoBERTa, and BERTweet26 models, following 5-fold cross-validation for each individual model. They also used the tweets in the Task 1 and Task 4 training and validation sets for continued domain-adaptive pre-training of these models.27 The other team (BFCI28) used a Passive Aggressive classifier with bigrams as features in a bag-of-words representation. Three of the teams (UQ,19 KUL,20 and Explorers25) used techniques for addressing the class imbalance, including data augmentation (UQ19 and KUL20), over-/under-sampling (UQ19), and class weights (UQ19 and Explorers25).
Performance of benchmark systems and 18 teams’ systems on the test sets for the 5 #SMM4H 2023 shared tasks, with the best performance for each task in bold.
Team . | Task 1a . | Task 2b . | Task 3c . | Task 4d . | Task 5ae . | Task 5bf . |
---|---|---|---|---|---|---|
ABCD | — | — | — | — | 0.188 | 0.089 |
BFCI | 0.637 | 0.714 | — | — | — | — |
CEN | — | — | — | 0.718 | — | — |
DS4DH | — | — | — | — | 0.426 | 0.292 |
Explorers | 0.872 | — | 0.94 | 0.695 | — | — |
HULAT-UC3M | — | 0.669 | — | — | — | — |
I2R-MI | — | 0.752 | — | — | 0.322 | 0.195 |
IICU-DSRG | 0.898 | — | — | — | — | — |
ITT | — | — | — | 0.728 | — | — |
KUL | 0.877 | — | — | — | — | — |
Mantis | — | — | — | 0.838 | — | — |
MUCS | 0.741 | 0.185 | — | — | — | — |
Shayona | 0.943 | — | — | 0.810 | — | — |
ThaparUni | — | — | — | 0.842 | — | — |
TMN | 0.845 | — | — | — | — | — |
UQ | 0.943 | 0.778 | — | 0.871 | — | — |
Unknown-1g | — | 0.695 | — | 0.841 | — | — |
Unknown-2g | — | — | 0.89 | — | — | — |
Task 1 Benchmark | 0.938 | — | — | — | — | — |
Task 3 Benchmark | — | — | 0.90 | — | — | — |
Task 3 Benchmark | — | — | 0.92 | — | — | — |
Task 5 Benchmark | — | — | — | — | 0.452 | 0.363 |
Team . | Task 1a . | Task 2b . | Task 3c . | Task 4d . | Task 5ae . | Task 5bf . |
---|---|---|---|---|---|---|
ABCD | — | — | — | — | 0.188 | 0.089 |
BFCI | 0.637 | 0.714 | — | — | — | — |
CEN | — | — | — | 0.718 | — | — |
DS4DH | — | — | — | — | 0.426 | 0.292 |
Explorers | 0.872 | — | 0.94 | 0.695 | — | — |
HULAT-UC3M | — | 0.669 | — | — | — | — |
I2R-MI | — | 0.752 | — | — | 0.322 | 0.195 |
IICU-DSRG | 0.898 | — | — | — | — | — |
ITT | — | — | — | 0.728 | — | — |
KUL | 0.877 | — | — | — | — | — |
Mantis | — | — | — | 0.838 | — | — |
MUCS | 0.741 | 0.185 | — | — | — | — |
Shayona | 0.943 | — | — | 0.810 | — | — |
ThaparUni | — | — | — | 0.842 | — | — |
TMN | 0.845 | — | — | — | — | — |
UQ | 0.943 | 0.778 | — | 0.871 | — | — |
Unknown-1g | — | 0.695 | — | 0.841 | — | — |
Unknown-2g | — | — | 0.89 | — | — | — |
Task 1 Benchmark | 0.938 | — | — | — | — | — |
Task 3 Benchmark | — | — | 0.90 | — | — | — |
Task 3 Benchmark | — | — | 0.92 | — | — | — |
Task 5 Benchmark | — | — | — | — | 0.452 | 0.363 |
F1-score for the class of tweets that self-reported a COVID-19 diagnosis.
Micro-averaged F1-score for all 3 classes of tweets.
Strict F1-score for identifying the character offsets of COVID-19 symptoms.
F1-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis.
Micro-averaged F1-score for all MedDRA IDs.
Micro-averaged F1-score for MedDRA IDs that were not seen during training.
CodaLab submission could not be cross-referenced with a registered team name.
Performance of benchmark systems and 18 teams’ systems on the test sets for the 5 #SMM4H 2023 shared tasks, with the best performance for each task in bold.
Team . | Task 1a . | Task 2b . | Task 3c . | Task 4d . | Task 5ae . | Task 5bf . |
---|---|---|---|---|---|---|
ABCD | — | — | — | — | 0.188 | 0.089 |
BFCI | 0.637 | 0.714 | — | — | — | — |
CEN | — | — | — | 0.718 | — | — |
DS4DH | — | — | — | — | 0.426 | 0.292 |
Explorers | 0.872 | — | 0.94 | 0.695 | — | — |
HULAT-UC3M | — | 0.669 | — | — | — | — |
I2R-MI | — | 0.752 | — | — | 0.322 | 0.195 |
IICU-DSRG | 0.898 | — | — | — | — | — |
ITT | — | — | — | 0.728 | — | — |
KUL | 0.877 | — | — | — | — | — |
Mantis | — | — | — | 0.838 | — | — |
MUCS | 0.741 | 0.185 | — | — | — | — |
Shayona | 0.943 | — | — | 0.810 | — | — |
ThaparUni | — | — | — | 0.842 | — | — |
TMN | 0.845 | — | — | — | — | — |
UQ | 0.943 | 0.778 | — | 0.871 | — | — |
Unknown-1g | — | 0.695 | — | 0.841 | — | — |
Unknown-2g | — | — | 0.89 | — | — | — |
Task 1 Benchmark | 0.938 | — | — | — | — | — |
Task 3 Benchmark | — | — | 0.90 | — | — | — |
Task 3 Benchmark | — | — | 0.92 | — | — | — |
Task 5 Benchmark | — | — | — | — | 0.452 | 0.363 |
Team . | Task 1a . | Task 2b . | Task 3c . | Task 4d . | Task 5ae . | Task 5bf . |
---|---|---|---|---|---|---|
ABCD | — | — | — | — | 0.188 | 0.089 |
BFCI | 0.637 | 0.714 | — | — | — | — |
CEN | — | — | — | 0.718 | — | — |
DS4DH | — | — | — | — | 0.426 | 0.292 |
Explorers | 0.872 | — | 0.94 | 0.695 | — | — |
HULAT-UC3M | — | 0.669 | — | — | — | — |
I2R-MI | — | 0.752 | — | — | 0.322 | 0.195 |
IICU-DSRG | 0.898 | — | — | — | — | — |
ITT | — | — | — | 0.728 | — | — |
KUL | 0.877 | — | — | — | — | — |
Mantis | — | — | — | 0.838 | — | — |
MUCS | 0.741 | 0.185 | — | — | — | — |
Shayona | 0.943 | — | — | 0.810 | — | — |
ThaparUni | — | — | — | 0.842 | — | — |
TMN | 0.845 | — | — | — | — | — |
UQ | 0.943 | 0.778 | — | 0.871 | — | — |
Unknown-1g | — | 0.695 | — | 0.841 | — | — |
Unknown-2g | — | — | 0.89 | — | — | — |
Task 1 Benchmark | 0.938 | — | — | — | — | — |
Task 3 Benchmark | — | — | 0.90 | — | — | — |
Task 3 Benchmark | — | — | 0.92 | — | — | — |
Task 5 Benchmark | — | — | — | — | 0.452 | 0.363 |
F1-score for the class of tweets that self-reported a COVID-19 diagnosis.
Micro-averaged F1-score for all 3 classes of tweets.
Strict F1-score for identifying the character offsets of COVID-19 symptoms.
F1-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis.
Micro-averaged F1-score for all MedDRA IDs.
Micro-averaged F1-score for MedDRA IDs that were not seen during training.
CodaLab submission could not be cross-referenced with a registered team name.
For Task 2, 2 of the 4 teams that submitted a system description (UQ19 and I2R-MI29) used BERT-based classifiers: BERTweet-Large26 (UQ19) and RoBERTa-Base23 (IR2-MI29). These teams also used techniques for addressing the class imbalance, including data augmentation (UQ19), over-sampling (UQ19), under-sampling (UQ19 and IR2-MI29), and class weights (UQ19). Furthermore, UQ19 used a large language model (LLM), GPT-3.5,30 to verify the predictions of the BERTweet-Large26 classifier, automatically changing the prediction to the majority (ie, neutral sentiment) class if the LLM did not find evidence in the tweet to support the classification of positive or negative sentiment. The 2 teams that did not use a BERT-based classifier (BFCI28 and HULAT-UC3M) used a Multilayer Perceptron classifier with 5-fold cross-validation (BFCI28) and a Random Forest classifier with feature engineering (HULAT-UC3M).
For Task 4, all 5 of the teams that submitted a system description (UQ,19 ThaparUni,31 Mantis,32 Shayona,18 and Explorers25) used BERT-based models, with 2 of the teams (UQ19 and Explorers25) using BERTweet26 models. As in Task 2, UQ19 used techniques for addressing the class imbalance and a LLM to correct the predictions of the BERTweet26 classifier. As in Task 1, Explorers25 used BERTweet26 in an ensemble with RoBERTa-Base23 and CPM-RoBERTa models, domain-adaptive pre-training,27 and techniques for addressing the class imbalance. Also as in Task 1, Shayona18 used a COVID-Twitter-BERT22 model with gradient boosting.33 Along with Explorers,25 the other 2 teams (ThaparUni31 and Mantis32) used an ensemble of pre-trained transformer models; while ThaparUni31 used RoBERTa,23 ERNIE 2.0,34 and XLNet35 models, which were pre-trained on general-domain corpora, Mantis32 used MentalRoBERTa36 and PyschBERT37 models, which were pre-trained on corpora related to mental health, including Reddit posts.
Extraction
For Task 3, the one team that submitted a system description (Explorers25) used the W2NER framework38 for named entity recognition and an ensemble of 3 Spanish BERT-based models: Spanish BERT (BETO),39 BETO+NER, and a version of BETO that was fine-tuned using the Spanish portion of the XLNI corpus.40 As in Task 1 and Task 4, they also used the tweets in the training and validation sets for continued domain-adaptive pre-training27 of these models. Two benchmark systems used BETO39 and COVID-Twitter-BERT22 models. For Task 5, the benchmark system4 and 2 teams that submitted a system description (DS4DH41 and I2R-MI29) also used pre-trained transformer models, in a pipeline approach that involves extracting ADEs as input for mapping the tweets to MedDRA terms: BERT42 (benchmark system5), BERTweet26 (DS4DH41), and T543 (I2R-MI29).
Normalization
For Task 5, the 2 teams that submitted a system description (DS4DH41 and I2R-MI29) took similar approaches to ADE normalization, using similarity metrics to compare the vector representations of extracted ADEs to MedDRA terms, based on pre-trained sentence transformer models. DS4DH41 used multiple sentence transformers to represent each extracted ADE—S-PubMedBERT,44 All-MPNet-Base-v2,45 All-DistilRoBERTa-v1,45 All-MiniLM-L6-v2,45 and a custom sentence transformer that integrates knowledge from the Unified Medical Language System (UMLS) Metathesaurus—and aggregated their similarity scores via reciprocal-rank fusion,46 while I2R-MI29 used only a single SBERT45 model for the similarity ranking. In contrast, the benchmark system5 took a learning-based approach, using a BERT42 model for multi-class classification.
Results and discussion
Table 2 presents the performance of 18 teams’ systems on the test sets for the 5 tasks. For Task 1, the evaluation metric was the F1-score for the class of tweets that self-reported a COVID-19 diagnosis. Shayona18 and UQ19 achieved nearly identical F1-scores (0.943) using a COVID-Twitter-BERT22 model, with Shayona18 marginally outperforming UQ19; however, neither gradient boosting33 (Shayona18) nor techniques for addressing the class imbalance (UQ19) substantially improved performance over a COVID-Twitter-BERT benchmark classifier3 (0.938), which was trained using the Flair library with 5 epochs, a batch size of 8, and a learning rate of 1e-5. While the recall of Shayona’s18 (0.938) and UQ’s19 (0.950) systems outperformed that of the benchmark system (0.914), the benchmark system3 achieved the highest precision (0.962) among all of the systems. These 3 systems outperformed IICU-DSRG’s and TMN’s21 systems that used COVID-Twitter-BERT22 with additional BERT-based models. While KUL20 did not use additional models, the lower performance of their system (0.877) may be a result of the variation in hyperparameters used for fine-tuning COVID-Twitter-BERT22: 10 epochs, a batch size of 32, and a learning rate of 5e-5. All of the systems that used BERT-based classifiers substantially outperformed BFCI’s28 Passive Aggressive classifier (0.637).
For Task 2, the evaluation metric was the micro-averaged F1-score for all 3 classes of tweets: positive sentiment, negative sentiment, and neutral sentiment. UQ19 achieved not only the highest micro-averaged F1-score (0.778), but also the highest F1-score for each of the positive (0.611), negative (0.415), and neutral (0.859) classes. An ablation study by UQ19 showed that the use of a LLM30 to verify the predictions of the BERTweet-Large26 classifier improved the micro-averaged F1-score by more than 3 points, contributing to outperforming I2R-MI’s29 RoBERTa-Base23 classifier (0.752). UQ’s19 and I2R-MI’s29 BERT-based classifiers outperformed BFCI’s28 Multilayer Perceptron (0.714) and HULAT-UC3M’s Random Forest (0.669) classifiers.
For Task 3, the evaluation metric was the strict F1-score for identifying the character offsets of COVID-19 symptoms. Explorers’25 ensemble of 3 Spanish BERT-based models and continued domain-adaptive pre-training27 of these models achieved not only a higher F1-score (0.94) than that of BETO39 (0.90) and COVID-Twitter-BERT22 (0.92) benchmark systems, but also a higher precision (0.94) and recall (0.93). Despite the high performance of Explorers’25 system, their error analysis revealed that colloquial/regional spelling variants of symptoms remain a challenge; for example, their system was able to identify gripe, but not gripa and gripes.
For Task 4, the evaluation metric was the F1-score for the class of Reddit posts that self-reported a social anxiety disorder diagnosis. Using the same methods as in Task 2, UQ19 also achieved the highest F1-score (0.871) for Task 4. As with UQ,19 Explorers25 also used BERTweet26 and techniques for addressing the class imbalance, though they used it an ensemble with 2 additional BERT-based models and achieved a substantially lower F1-score (0.695). ThaparUni’s31 ensemble of 3 transformer models that were pre-trained on general-domain corpora (0.842)—RoBERTa,23 ERNIE 2.0,34 and XLNet35—marginally outperformed Mantis’32 ensemble of 2 transformer models that were pre-trained on corpora related to mental health, including Reddit posts (0.838)—MentalRoBERTa36 and PyschBERT.37 Both of these ensemble-based systems outperformed Shayona’s18 COVID-Twitter-BERT22 classifier (0.810), suggesting that this model may not generalize to non-COVID-19 diagnoses.
To enable novel approaches for Task 5, in contrast to previous iterations of this task that used a pipeline-based evaluation,12–17 the evaluation did not include the output of classification or extraction. The primary evaluation metric was the micro-averaged F1-score for all MedDRA IDs in the test set. A secondary evaluation metric was based on a zero-shot learning setup—that is, for MedDRA IDs in the test set that were not seen during training. While DS4DH41 achieved the highest overall precision (0.449) and I2R-MI29 achieved the highest recall for the unseen ADEs (0.406), the benchmark system5 achieved the highest overall recall (0.508) and F1-score (0.452) and the highest precision (0.335) and F1-score (0.363) for the unseen ADEs. In contrast to DS4DH’s41 and I2R-MI’s29 2-component pipelines, the benchmark system’s pipeline includes an initial binary classification component to detect tweets that mention an ADE, pre-filtering the tweets for downstream extraction and normalization.5
In general, the top-performing team for each of the 5 tasks used a deep neural network architecture based on pre-trained transformer models. In total, 10 of the 12 teams that submitted a system description used pre-trained transformer models. The 2 teams that used feature-engineered algorithms were outperformed by all of the teams that used pre-trained transformer models. In particular, the top-performing team for each of the 3 classification tasks used a model that was pre-trained on a social media corpus. While half—5 of the 10—teams that participated in the classification tasks used an ensemble-based system, the top-performing system for each of these 3 tasks was based on a single model. In addition, the techniques that 4 of these 10 teams used for addressing the class imbalance did not appear to substantially improve performance. In contrast, for the second iteration of the #SMM4H shared tasks, which was hosted at the AMIA 2017 Annual Symposium, deep neural networks were used in only approximately half of the systems and were outperformed by Support Vector Machine (SVM) and Logistic Regression classifiers on highly imbalanced data.17 Although none of the teams used SVMs for the #SMM4H 2023 shared tasks, our recent work47–49 has demonstrated that BERT-based classifiers outperform SVM classifiers for health-related tasks using social media data.
The high level of performance that can be achieved with recent advances in deep learning and pre-trained transformer models enables the large-scale use of social media as a complementary source of real-world data for research applications. For example, the benchmark classifier (F1-score = 0.938) that was developed for Task 13 was deployed on the timelines of more than 67 000 users who had reported their pregnancy on Twitter, for a retrospective cohort study that used self-reports in longitudinal social media data to assess the association between COVID-19 infection during pregnancy and preterm birth.50 This cohort study also points to the fact that, despite recent changes to the accessibility of data through the Twitter and Reddit APIs, data that were previously collected continue to have utility. For collecting new data, however, these changes may motivate researchers to explore additional platforms that have been proposed in the #SMM4H Workshop, such as WebMD,12 Ask a Patient,51 YouTube,52 Tumblr,53 Mumsnet,54 MedHelp,55 Facebook,55,56 Diabetes Info,57 and Beyond Blue.58
Conclusion
This article presented an overview of the #SMM4H 2023 shared tasks, which represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and ADEs). To facilitate future work, the datasets will remain available by request, and the CodaLab sites6–10 will remain active to automatically evaluate new systems against the blind test sets, promoting the ongoing systematic comparison of performance.
Acknowledgments
The authors thank those who contributed to annotating the data, the program committee of the #SMM4H 2023 Workshop, and additional peer reviewers of the system description papers.
Author contributions
A.Z.K.: data collection, annotation, benchmark system, system description review, and wrote the manuscript. J.M.B.: shared task conceptualization, data collection, annotation, benchmark system, CodaLab site, system description review, and wrote the manuscript. Y.G.: data collection, annotation, CodaLab site, system description review, and wrote the manuscript. A.L.S.: data collection, annotation, CodaLab site, system description review, and wrote the manuscript. D.X.: data collection, annotation, benchmark system, CodaLab site, system description review, and wrote the manuscript. J.I.F.A.: data collection, CodaLab site, and edited the manuscript. R.R.-E.: shared task conceptualization and edited the manuscript. A.S.: shared task conceptualization and edited the manuscript. G.G.-H.: shared task conceptualization and guidance, and edited the manuscript.
Funding
A.Z.K., I.F.A., D.X., and G.G.-H. were supported in part by the National Library of Medicine (R01LM011176). Y.G. and A.S. were supported in part by the National Institute on Drug Abuse (R01DA057599). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. J.M.B. was supported in part by a Google Award for Inclusion Research (AIR).
Conflicts of interest
None declared.
Data availability
According to the Twitter Terms of Service, the content (eg, text) of Tweet Objects cannot be made publicly available; however, a limited number of Tweet Objects are permitted to be shared directly. Requests for data can be sent to A.Z.K. ([email protected]) or G.G.-H. ([email protected]).
References
CodaLab. SMM4H 2023 Task 1 – Binary classification of English tweets self-reporting a COVID-19 diagnosis. Accessed October 20,
CodaLab. SMM4H 2023 Task 2: Classification of sentiment associated with therapies (aspect-oriented). Accessed October 20, 2023. https://codalab.lisn.upsaclay.fr/competitions/12421
CodaLab. SMM4H 2023 Task 3 – Extraction of COVID-19 symptoms in Latin American Spanish tweets. Accessed October 20,
CodaLab. SMM4H23 Task 4 – Binary classification of posts self-reporting a social anxiety disorder diagnosis. Accessed October 20, 2023. https://codalab.lisn.upsaclay.fr/competitions/13160
CodaLab. SMM4H 2023 Task 5 – Normalization of adverse drug events in English tweets. Accessed October 20,