Suppr超能文献

基于图嵌入的阿尔茨海默病文献发现链路预测。

Graph embedding-based link prediction for literature-based discovery in Alzheimer's Disease.

机构信息

School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.

School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia; School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia.

出版信息

J Biomed Inform. 2023 Sep;145:104464. doi: 10.1016/j.jbi.2023.104464. Epub 2023 Aug 2.

Abstract

OBJECTIVE

We explore the framing of literature-based discovery (LBD) as link prediction and graph embedding learning, with Alzheimer's Disease (AD) as our focus disease context. The key link prediction setting of prediction window length is specifically examined in the context of a time-sliced evaluation methodology.

METHODS

We propose a four-stage approach to explore literature-based discovery for Alzheimer's Disease, creating and analyzing a knowledge graph tailored to the AD context, and predicting and evaluating new knowledge based on time-sliced link prediction. The first stage is to collect an AD-specific corpus. The second stage involves constructing an AD knowledge graph with identified AD-specific concepts and relations from the corpus. In the third stage, 20 pairs of training and testing datasets are constructed with the time-slicing methodology. Finally, we infer new knowledge with graph embedding-based link prediction methods. We compare different link prediction methods in this context. The impact of limiting prediction evaluation of LBD models in the context of short-term and longer-term knowledge evolution for Alzheimer's Disease is assessed.

RESULTS

We constructed an AD corpus of over 16 k papers published in 1977-2021, and automatically annotated it with concepts and relations covering 11 AD-specific semantic entity types. The knowledge graph of Alzheimer's Disease derived from this resource consisted of ∼11 k nodes and ∼394 k edges, among which 34% were genotype-phenotype relationships, 57% were genotype-genotype relationships, and 9% were phenotype-phenotype relationships. A Structural Deep Network Embedding (SDNE) model consistently showed the best performance in terms of returning the most confident set of link predictions as time progresses over 20 years. A huge improvement in model performance was observed when changing the link prediction evaluation setting to consider a more distant future, reflecting the time required for knowledge accumulation.

CONCLUSION

Neural network graph-embedding link prediction methods show promise for the literature-based discovery context, although the prediction setting is extremely challenging, with graph densities of less than 1%. Varying prediction window length on the time-sliced evaluation methodology leads to hugely different results and interpretations of LBD studies. Our approach can be generalized to enable knowledge discovery for other diseases.

AVAILABILITY

Code, AD ontology, and data are available at https://github.com/READ-BioMed/readbiomed-lbd.

摘要

目的

我们探索将基于文献的发现(LBD)作为链接预测和图嵌入学习进行阐述,以阿尔茨海默病(AD)为重点疾病背景。特别在时间切片评估方法学的背景下,研究关键的链接预测设置预测窗口长度。

方法

我们提出了一种四阶段方法来探索阿尔茨海默病的基于文献的发现,创建和分析针对 AD 背景的知识图谱,并基于时间切片链接预测来预测和评估新的知识。第一阶段是收集 AD 特定的语料库。第二阶段涉及从语料库中构建具有识别的 AD 特定概念和关系的 AD 知识图谱。在第三阶段,使用时间切片方法构建 20 对训练和测试数据集。最后,我们使用基于图嵌入的链接预测方法推断新的知识。我们在这种情况下比较了不同的链接预测方法。评估了在阿尔茨海默病的短期和长期知识演化背景下限制 LBD 模型预测评估的影响。

结果

我们构建了一个包含 1977 年至 2021 年发表的超过 16k 篇论文的 AD 语料库,并自动对其进行了概念和关系的标注,涵盖了 11 种 AD 特定的语义实体类型。从这个资源中派生的阿尔茨海默病知识图谱包含约 11k 个节点和约 394k 条边,其中 34%是基因型-表型关系,57%是基因型-基因型关系,9%是表型-表型关系。结构深度网络嵌入(SDNE)模型在随着时间推移超过 20 年的过程中,返回最自信的链接预测集方面表现出了最好的性能。当将链接预测评估设置更改为考虑更遥远的未来时,模型性能有了巨大的提高,反映了知识积累所需的时间。

结论

神经网络图嵌入链接预测方法在基于文献的发现背景下具有很大的潜力,尽管预测设置极具挑战性,图密度低于 1%。在时间切片评估方法学上改变预测窗口长度会导致 LBD 研究的结果和解释产生巨大差异。我们的方法可以推广到其他疾病的知识发现。

可用性

代码、AD 本体和数据可在 https://github.com/READ-BioMed/readbiomed-lbd 上获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验