Suppr超能文献

新发传染病医学预印本主题挖掘与主题预测新方法研究

Research on New Methods of Topic Mining and Topic Prediction for Medical Preprints on Emerging Infectious Diseases.

作者信息

Liang Zongjing, Kuang Yun, Liang Gongcheng, Li Zhijie, Jiang Mingfeng

机构信息

School of Economics and Management, Guangxi Normal University, Guilin, CHN.

Library, Guilin Normal University, Guilin, CHN.

出版信息

Cureus. 2025 Jun 11;17(6):e85773. doi: 10.7759/cureus.85773. eCollection 2025 Jun.

Abstract

Background and purpose To cope with the continuous risk of sudden infectious diseases and achieve real-time monitoring of research trends, this paper proposes a new prediction framework that combines public attention indicators with medical preprint topic analysis. In view of the lag problem of traditional topic prediction methods, this paper introduces Google Trends data to improve the timeliness of prediction. Methods In this study, 18,060 COVID-19-related preprint abstracts were obtained from the medRxiv platform using web crawler technology. The unsupervised probabilistic modeling method, Latent Dirichlet Allocation (LDA), was used to extract the latent topic structure in the text. In order to analyze the dynamic relationship between research topic intensity and public attention, the Autoregressive Distributed Lag (ARDL) model, which can simultaneously process I(0) and I(1) time series, was introduced. Text data preprocessing included word segmentation, stop word removal, lemmatization, and synonym standardization. Time series data were aggregated by week, the original data were logarithmized, the Augmented Dickey-Fuller (ADF) unit root test was used to determine stationarity, and non-stationary variables were differenced. The models were implemented in Python and EViews10, respectively. Results Seven major research topics were identified through LDA modeling. ARDL analysis verified that there was a significant dynamic relationship between public search trends and topic intensity, and that the model had good predictive performance. Conclusion This study combined LDA with ARDL models to construct a real-time prediction method that can be used to track the evolution of medical preprint topics. This method has important theoretical and practical significance in the field of public health informatics and provides feasible predictive support for the monitoring and prevention of future infectious diseases.

摘要

背景与目的 为应对突发传染病的持续风险并实现对研究趋势的实时监测,本文提出一种将公众关注指标与医学预印本主题分析相结合的新预测框架。鉴于传统主题预测方法的滞后问题,本文引入谷歌趋势数据以提高预测的及时性。方法 在本研究中,使用网络爬虫技术从medRxiv平台获取了18,060篇与COVID-19相关的预印本摘要。采用无监督概率建模方法——潜在狄利克雷分配(LDA)来提取文本中的潜在主题结构。为了分析研究主题强度与公众关注度之间的动态关系,引入了能够同时处理I(0)和I(1)时间序列的自回归分布滞后(ARDL)模型。文本数据预处理包括分词、停用词去除、词形还原和同义词标准化。时间序列数据按周进行汇总,对原始数据取对数,使用增强迪基 - 富勒(ADF)单位根检验来确定平稳性,对非平稳变量进行差分。模型分别在Python和EViews软件中实现。结果 通过LDA建模确定了七个主要研究主题。ARDL分析证实公众搜索趋势与主题强度之间存在显著的动态关系,且该模型具有良好的预测性能。结论 本研究将LDA与ARDL模型相结合,构建了一种可用于跟踪医学预印本主题演变的实时预测方法。该方法在公共卫生信息学领域具有重要的理论和实践意义,为未来传染病的监测与预防提供了可行的预测支持。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/276d/12248262/0f32437d1ff8/cureus-0017-00000085773-i01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验