Zhang Yuji, Shen Feichen, Mojarad Majid Rastegar, Li Dingcheng, Liu Sijia, Tao Cui, Yu Yue, Liu Hongfang
Division of Biostatistics and Bioinformatics, University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, Baltimore, Maryland, United States of America.
Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, United States of America.
PLoS One. 2018 Jan 26;13(1):e0191568. doi: 10.1371/journal.pone.0191568. eCollection 2018.
Recent scientific advances have accumulated a tremendous amount of biomedical knowledge providing novel insights into the relationship between molecular and cellular processes and diseases. Literature mining is one of the commonly used methods to retrieve and extract information from scientific publications for understanding these associations. However, due to large data volume and complicated associations with noises, the interpretability of such association data for semantic knowledge discovery is challenging. In this study, we describe an integrative computational framework aiming to expedite the discovery of latent disease mechanisms by dissecting 146,245 disease-gene associations from over 25 million of PubMed indexed articles. We take advantage of both Latent Dirichlet Allocation (LDA) modeling and network-based analysis for their capabilities of detecting latent associations and reducing noises for large volume data respectively. Our results demonstrate that (1) the LDA-based modeling is able to group similar diseases into disease topics; (2) the disease-specific association networks follow the scale-free network property; (3) certain subnetwork patterns were enriched in the disease-specific association networks; and (4) genes were enriched in topic-specific biological processes. Our approach offers promising opportunities for latent disease-gene knowledge discovery in biomedical research.
最近的科学进展积累了大量生物医学知识,为深入了解分子和细胞过程与疾病之间的关系提供了新的见解。文献挖掘是从科学出版物中检索和提取信息以理解这些关联的常用方法之一。然而,由于数据量庞大且与噪声的关联复杂,此类关联数据用于语义知识发现的可解释性具有挑战性。在本研究中,我们描述了一个综合计算框架,旨在通过剖析来自超过2500万篇PubMed索引文章中的146,245个疾病-基因关联来加速潜在疾病机制的发现。我们利用潜在狄利克雷分配(LDA)建模和基于网络的分析,分别发挥它们检测潜在关联和减少大量数据噪声的能力。我们的结果表明:(1)基于LDA的建模能够将相似疾病分组为疾病主题;(2)疾病特异性关联网络遵循无标度网络特性;(3)特定子网模式在疾病特异性关联网络中富集;(4)基因在主题特异性生物学过程中富集。我们的方法为生物医学研究中潜在疾病-基因知识的发现提供了有前景的机会。
BMC Public Health. 2016-3-19
Comput Methods Programs Biomed. 2018-7-17
Pac Symp Biocomput. 2012
Expert Opin Drug Saf. 2018-4-6
IEEE Trans Image Process.
Front Artif Intell. 2025-3-19
BMC Med Inform Decis Mak. 2019-2-14
BMC Med Inform Decis Mak. 2019-1-7
SHB12 (2012). 2012-10-29
Nat Rev Immunol. 2017-7
Proc Natl Acad Sci U S A. 2017-3-28
Nat Commun. 2017-3-13
Nucleic Acids Res. 2017-1-4
Nucleic Acids Res. 2017-1-4
Front Aging Neurosci. 2016-3-24