Suppr超能文献

避免背景知识:从重要信息中进行基于文献的发现。

Avoiding background knowledge: literature based discovery from important information.

机构信息

Information School, University of Sheffield, S1 4DP, Sheffield, UK.

出版信息

BMC Bioinformatics. 2023 Mar 14;23(Suppl 9):570. doi: 10.1186/s12859-022-04892-8.

Abstract

BACKGROUND

Automatic literature based discovery attempts to uncover new knowledge by connecting existing facts: information extracted from existing publications in the form of [Formula: see text] and [Formula: see text] relations can be simply connected to deduce [Formula: see text]. However, using this approach, the quantity of proposed connections is often too vast to be useful. It can be reduced by using subjectFormula: see text[Formula: see text]object triples as the [Formula: see text] relations, but too many proposed connections remain for manual verification.

RESULTS

Based on the hypothesis that only a small number of subject-predicate-object triples extracted from a publication represent the paper's novel contribution(s), we explore using BERT embeddings to identify these before literature based discovery is performed utilizing only these, important, triples. While the method exploits the availability of full texts of publications in the CORD-19 dataset-making use of the fact that a novel contribution is likely to be mentioned in both an abstract and the body of a paper-to build a training set, the resulting tool can be applied to papers with only abstracts available. Candidate hidden knowledge pairs generated from unfiltered triples and those built from important triples only are compared using a variety of timeslicing gold standards.

CONCLUSIONS

The quantity of proposed knowledge pairs is reduced by a factor of [Formula: see text], and we show that when the gold standard is designed to avoid rewarding background knowledge, the precision obtained increases up to a factor of 10. We argue that the gold standard needs to be carefully considered, and release as yet undiscovered candidate knowledge pairs based on important triples alongside this work.

摘要

背景

自动文献基础发现试图通过连接现有事实来揭示新知识:以[公式:见正文]和[公式:见正文]关系的形式从现有出版物中提取的信息可以简单地连接起来,以推断[公式:见正文]。然而,使用这种方法,提出的连接数量通常过于庞大,无法使用。通过使用主题[公式:见正文](谓语)[公式:见正文]对象三元组作为[公式:见正文]关系,可以减少这种情况,但仍然有太多的连接需要手动验证。

结果

基于从出版物中提取的少量主题-谓语-对象三元组代表论文的新颖贡献的假设,我们探索了使用 BERT 嵌入来识别这些三元组,然后仅利用这些重要的三元组进行文献基础发现。虽然该方法利用了 CORD-19 数据集中文献的全文可用性(利用新颖贡献很可能在论文的摘要和正文中都提到的事实)来构建训练集,但该工具可以应用于仅提供摘要的论文。使用各种时间切片金标准比较从未过滤的三元组和仅从重要的三元组生成的候选隐藏知识对。

结论

提出的知识对数量减少了[公式:见正文]倍,我们表明,当金标准旨在避免奖励背景知识时,精度可以提高到 10 倍。我们认为,需要仔细考虑金标准,并在这项工作的基础上发布尚未发现的基于重要三元组的候选知识对。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e40/10015730/72075486d1c7/12859_2022_4892_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验