Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA.
School of Pharmacy, Sungkyunkwan University, Suwon, Republic of Korea.
BMC Bioinformatics. 2020 Oct 2;21(1):432. doi: 10.1186/s12859-020-03756-3.
In systems biology, it is of great interest to identify previously unreported associations between genes. Recently, biomedical literature has been considered as a valuable resource for this purpose. While classical clustering algorithms have popularly been used to investigate associations among genes, they are not tuned for the literature mining data and are also based on strong assumptions, which are often violated in this type of data. For example, these approaches often assume homogeneity and independence among observations. However, these assumptions are often violated due to both redundancies in functional descriptions and biological functions shared among genes. Latent block models can be alternatives in this case but they also often show suboptimal performances, especially when signals are weak. In addition, they do not allow to utilize valuable prior biological knowledge, such as those available in existing databases.
In order to address these limitations, here we propose PALMER, a constrained latent block model that allows to identify indirect relationships among genes based on the biomedical literature mining data. By automatically associating relevant Gene Ontology terms, PALMER facilitates biological interpretation of novel findings without laborious downstream analyses. PALMER also allows researchers to utilize prior biological knowledge about known gene-pathway relationships to guide identification of gene-gene associations. We evaluated PALMER with simulation studies and applications to studies of pathway-modulating genes relevant to cancer signaling pathways, while utilizing biological pathway annotations available in the KEGG database as prior knowledge.
We showed that PALMER outperforms traditional latent block models and it provides reliable identification of novel gene-gene associations by utilizing prior biological knowledge, especially when signals are weak in the biomedical literature mining dataset. We believe that PALMER and its relevant user-friendly software will be powerful tools that can be used to improve existing pathway annotations and identify novel pathway-modulating genes.
在系统生物学中,识别基因之间以前未报告的关联具有重要意义。最近,生物医学文献已被视为实现这一目标的有价值资源。虽然经典聚类算法已被广泛用于研究基因之间的关联,但它们不针对文献挖掘数据进行调整,并且基于强烈的假设,而这些假设在这种类型的数据中经常被违反。例如,这些方法通常假设观察值之间具有同质性和独立性。然而,由于功能描述和基因之间共享的生物学功能的冗余,这些假设经常被违反。潜在块模型在这种情况下可以作为替代方案,但它们的性能也往往不理想,尤其是在信号较弱时。此外,它们不允许利用现有数据库中提供的有价值的先验生物学知识。
为了解决这些限制,我们在这里提出了 PALMER,这是一种受限的潜在块模型,允许根据生物医学文献挖掘数据识别基因之间的间接关系。通过自动关联相关的基因本体论术语,PALMER 便于在没有繁琐的下游分析的情况下对新发现进行生物学解释。PALMER 还允许研究人员利用已知基因-途径关系的先验生物学知识来指导基因-基因关联的识别。我们使用模拟研究和对与癌症信号通路相关的通路调节基因的研究来评估 PALMER,同时利用 KEGG 数据库中可用的生物学途径注释作为先验知识。
我们表明,PALMER 优于传统的潜在块模型,并且通过利用先验生物学知识,特别是在生物医学文献挖掘数据集信号较弱时,它可以可靠地识别新的基因-基因关联。我们相信 PALMER 及其相关的用户友好型软件将成为强大的工具,可以用于改进现有的途径注释并识别新的途径调节基因。