Suppr超能文献

从植物科学文献中提取知识网络:以马铃薯块茎颜色为例证特征。

Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait.

机构信息

Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ, The Netherlands.

IBM Netherlands, Amsterdam, The Netherlands.

出版信息

BMC Plant Biol. 2021 Apr 24;21(1):198. doi: 10.1186/s12870-021-02943-5.

Abstract

BACKGROUND

Scientific literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the flesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes.

RESULTS

We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to flesh color. A novel time-based analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature.

CONCLUSIONS

Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientific research.

摘要

背景

科学文献承载着大量对研究至关重要的信息,但只有一小部分以数据库中的结构化信息形式存在,因此可以使用传统数据分析工具进行分析。自然语言处理 (NLP) 常用于从大量自由文本中提取相关信息,并以适合进一步计算分析的方式对其进行结构化,从而为人类提供支持。为此试点研究,我们开发了一个使用 NLP 从生物文献中生成知识网络的管道。我们专注于马铃薯的果肉颜色,这是一个研究充分、具有已知关联的特征,并研究这些知识网络是否可以帮助我们提出关于潜在生物学过程的新假设。

结果

我们基于 34 篇马铃薯全文文章的手动注释语料库训练了一个 NLP 模型,以识别文本中的相关生物实体及其之间的关系(基因、蛋白质、代谢物和特征)。该模型在训练集上检测生物实体的数量的精度为 97.65%,召回率为 88.91%。我们对 4023 篇基于植物遗传学的 PubMed 摘要进行了时间序列分析,这些文章主要集中在四大茄科作物(番茄、马铃薯、茄子和辣椒)上,以确定网络中既包含先前已知的,也包含同时未知的与果肉颜色相关的生物学现象的线索。对这些网络进行的新型时间分析表明,我们的特征与候选基因(玉米黄质环氧化酶)之间存在联系,这一联系在文献中明确表述之前已有两年之久。

结论

我们的时间分析表明,基于网络的假设生成分析在科学研究中的知识发现、数据集成和假设生成方面具有广阔的应用前景。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9186/8070292/5d2be2ac89ce/12870_2021_2943_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验