Saini Sonia, Agarwal Ruchi, Singh S P, Gupta Punit, Vidhyarthi Ankit, Verma Rohit
Associate Consultant, Tata Consultancy Services, Noida, India.
Professor, Computer Applications Department, JIMS Engineering Management Technical Campus, Greater Noida, India.
PLoS One. 2025 Jun 5;20(6):e0323449. doi: 10.1371/journal.pone.0323449. eCollection 2025.
Social Media has given an exponential rise to an ever-connected world. Health data that was earlier viewed as hospital records or clinical records is now being shared as text over social media. Information and updates regarding the outbreak of a pandemic, clinical visit results, general health updates, etc., are being analyzed. The data is now shared more frequently in various formats such as images, text, documents, and videos. With fast streaming systems and no constraints on storage spaces, all this shared rich media data is quite voluminous and informative. For shared health data such as discussions on ailments, hospital visits, general health well-being updates, and drug research updates via official Twitter handles of various pharmaceutical companies and healthcare organizations, a unique level of challenge is posed for analysis of this data. The text indicating the ailment often varies from proper medical jargon to common names for the same, whereas the intent is the same in predicting the disease or ailment term. This paper focuses on how we can extract and analyze health-related data exchanged on social media and introduce an Augmented Ensemble Model (AEM), which identifies the frequently shared topics and discussions about health on social networks, to predict the emerging health trends. The analytical model works with chronological datasets to deduce text classification of topics related to health. This Hybrid Model uses text data augmentation to address class imbalance for health terms and further employs a clustering technique for location-based aggregation. An algorithm for health terms Word Vector Embedding model is formulated. This Word Vector model is further used in Text Data Augmentation to reduce the class imbalance. We evaluate the accuracy of the classifiers by constructing a Machine Learning pipeline. For our Augmented Ensemble Model, the Text classification accuracy is evaluated after the augmentation using a voting ensemble technique, and a greater accuracy has been observed. Emerging health trends are analyzed via temporal classification and location-wise aggregation of the health terms. This model demonstrates that a Text Augmented Ensemble Machine Learning approach for health topics is more efficient than the conventional Machine Learning classification technique(s).
社交媒体使我们进入了一个联系日益紧密的世界,其发展呈指数级增长。以前被视为医院记录或临床记录的健康数据,现在正以文本形式在社交媒体上分享。有关大流行病爆发、临床就诊结果、一般健康状况更新等信息和动态正在接受分析。现在,这些数据以各种格式(如图像、文本、文档和视频)更频繁地共享。借助快速流系统且不受存储空间限制,所有这些共享的富媒体数据量巨大且信息丰富。对于通过各制药公司和医疗保健组织的官方推特账号分享的健康数据,如关于疾病的讨论、医院就诊情况、一般健康状况更新以及药物研究进展等,分析此类数据面临独特的挑战。描述疾病的文本往往从专业医学术语到同一疾病的常用名称各不相同,而预测疾病或病症术语时意图是相同的。本文重点探讨如何提取和分析在社交媒体上交换的健康相关数据,并引入一种增强集成模型(AEM),该模型可识别社交网络上关于健康的频繁共享主题和讨论,以预测新出现的健康趋势。该分析模型处理按时间顺序排列的数据集,以推断与健康相关主题的文本分类。这种混合模型使用文本数据增强来解决健康术语的类别不平衡问题,并进一步采用聚类技术进行基于位置的聚合。制定了一种用于健康术语的词向量嵌入模型算法。该词向量模型进一步用于文本数据增强,以减少类别不平衡。我们通过构建机器学习管道来评估分类器的准确性。对于我们的增强集成模型,在增强后使用投票集成技术评估文本分类准确性,观察到更高的准确性。通过对健康术语进行时间分类和按位置聚合来分析新出现的健康趋势。该模型表明,针对健康主题的文本增强集成机器学习方法比传统机器学习分类技术更有效。
Stud Health Technol Inform. 2019-8-21
BMC Bioinformatics. 2022-9-28
Int J Med Inform. 2019-5-30
Cochrane Database Syst Rev. 2022-2-1
Sensors (Basel). 2022-7-13
Healthcare (Basel). 2022-6-14
Health Promot Pract. 2014-3