Škrlj Blaž, Kokalj Enja, Lavrač Nada
Jožef Stefan International Postgraduate School, Ljubljana, Slovenia.
Jožef Stefan Institute, Ljubljana, Slovenia.
Front Res Metr Anal. 2021 Apr 13;6:644614. doi: 10.3389/frma.2021.644614. eCollection 2021.
PubMed is the largest resource of curated biomedical knowledge to date, entailing more than 25 million documents. Large quantities of novel literature prevent a single expert from keeping track of all potentially relevant papers, resulting in knowledge gaps. In this article, we present CHEMMESHNET, a newly developed PubMed-based network comprising more than 10,000,000 associations, constructed from expert-curated MeSH annotations of chemicals based on all currently available PubMed articles. By learning latent representations of concepts in the obtained network, we demonstrate in a proof of concept study that purely literature-based representations are sufficient for the reconstruction of a large part of the currently known network of physical, empirically determined protein-protein interactions. We demonstrate that simple linear embeddings of node pairs, when coupled with a neural network-based classifier, reliably reconstruct the existing collection of empirically confirmed protein-protein interactions. Furthermore, we demonstrate how pairs of learned representations can be used to prioritize potentially interesting novel interactions based on the common chemical context. Highly ranked interactions are qualitatively inspected in terms of potential complex formation at the structural level and represent potentially interesting new knowledge. We demonstrate that two protein-protein interactions, prioritized by structure-based approaches, also emerge as probable with regard to the trained machine-learning model.
PubMed是迄今为止最大的经过整理的生物医学知识资源库,包含超过2500万篇文献。大量的新文献使得单个专家难以追踪所有潜在相关论文,从而导致知识缺口。在本文中,我们介绍了CHEMMESHNET,这是一个新开发的基于PubMed的网络,包含超过1000万个关联,它是根据基于所有现有PubMed文章的化学物质专家策划的MeSH注释构建的。通过学习所获得网络中概念的潜在表示,我们在概念验证研究中证明,纯粹基于文献的表示足以重建目前已知的很大一部分物理上经实验确定的蛋白质-蛋白质相互作用网络。我们证明,当与基于神经网络的分类器结合时,节点对的简单线性嵌入能够可靠地重建现有的经实验证实的蛋白质-蛋白质相互作用集合。此外,我们展示了如何基于共同的化学背景,利用学习到的表示对来对潜在有趣的新相互作用进行优先级排序。对排名靠前的相互作用在结构层面上的潜在复合物形成方面进行定性检查,它们代表了潜在有趣的新知识。我们证明,通过基于结构的方法确定优先级的两种蛋白质-蛋白质相互作用,对于经过训练的机器学习模型来说也可能出现。