Feng Yiqiang, Chen Ziao, Zhang Yuxin, Huang Wenyuan, Zhang Xuanming, He Siyu
School of Marxism, Sichuan Agricultural University, Chengdu, China.
College of Law, Sichuan Agricultural University, Yaan, China.
Front Public Health. 2025 Aug 12;13:1608241. doi: 10.3389/fpubh.2025.1608241. eCollection 2025.
Adolescent health has become a critical dimension in the digital era, as social media platforms emerge as vital sources of real-time behavioral data for informing sustainable and equitable public health strategies. However, conventional topic modeling methods often struggle with the semantic sparsity and noise inherent in short-form texts. The study proposes BERTopic_Teen, an enhanced topic modeling framework optimized for adolescent health-related tweets. The model incorporates three key innovations: a Popularity Deviation Regularizer (PDR) to suppress high-frequency generic terms and amplify domain-specific vocabulary; a Dynamic Document Embedding Optimizer (DDEO) that adaptively selects optimal UMAP dimensions based on silhouette scores; and a Probabilistic Reassignment Matrix (PRM) to reassign outlier documents to relevant topic clusters. Using a dataset of 64,441 tweets (61,039 successfully classified), experimental results show that BERTopic_Teen outperforms LDA, NMF, Top2Vec, and the original BERTopic in all key evaluation metrics. It achieves a 16.1% improvement in topic coherence (NPMI = 0.2184), higher topic diversity (TD = 0.9935), and lower perplexity (1.7214), indicating superior semantic clarity, topic distinctiveness, and modeling stability. These findings suggest that BERTopic_Teen offers a robust solution for extracting meaningful topics from social media data and advancing public health surveillance.
在数字时代,青少年健康已成为一个关键维度,因为社交媒体平台已成为实时行为数据的重要来源,可为可持续和公平的公共卫生战略提供信息。然而,传统的主题建模方法往往难以应对短文本中固有的语义稀疏性和噪声问题。该研究提出了BERTopic_Teen,这是一个针对与青少年健康相关的推文进行优化的增强型主题建模框架。该模型包含三项关键创新:一个流行度偏差正则化器(PDR),用于抑制高频通用术语并放大特定领域的词汇;一个动态文档嵌入优化器(DDEO),它根据轮廓分数自适应地选择最佳的UMAP维度;以及一个概率重新分配矩阵(PRM),用于将离群文档重新分配到相关的主题簇中。使用一个包含64441条推文的数据集(成功分类61039条),实验结果表明,BERTopic_Teen在所有关键评估指标上均优于LDA、NMF、Top2Vec和原始的BERTopic。它在主题连贯性方面提高了16.1%(NPMI = 0.2184),具有更高的主题多样性(TD = 0.9935)和更低的困惑度(1.7214),表明其在语义清晰度、主题独特性和建模稳定性方面表现更优。这些发现表明,BERTopic_Teen为从社交媒体数据中提取有意义的主题并推进公共卫生监测提供了一个强大的解决方案。