从自由文本中提取基因/蛋白质相互作用方法的现实评估。

A realistic assessment of methods for extracting gene/protein interactions from free text.

作者信息

Kabiljo Renata, Clegg Andrew B, Shepherd Adrian J

机构信息

School of Crystallography and Institute of Structural and Molecular Biology, Birkbeck College, University of London, Malet Street, London WC1E 7HX UK.

出版信息

BMC Bioinformatics. 2009 Jul 28;10:233. doi: 10.1186/1471-2105-10-233.

DOI:10.1186/1471-2105-10-233

PMID:19635172

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2723093/

Abstract

BACKGROUND

The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger.

RESULTS

Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions.

CONCLUSION

In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community.

摘要

背景

从文献中自动提取基因和/或蛋白质相互作用是生物医学文本挖掘研究的最重要目标之一。在本文中，我们针对潜在的非专业用户对基因/蛋白质相互作用挖掘进行了实际评估。因此，我们特意避免了安装复杂或需要重新实现的方法，并将我们选择的提取方法与最先进的生物医学命名实体标记器相结合。

结果

我们的结果表明：不同评估语料库的性能差异极大；使用带标记的（相对于黄金标准）基因和蛋白质名称对性能有重大影响，F值下降超过20个百分点很常见；并且一个简单的基于关键词的基准算法与命名实体标记器相结合时，其性能优于最广泛用于提取基因/蛋白质相互作用的两个工具。

结论

就可用性、易用性和性能而言，当前工具和系统未能很好地满足有兴趣从自由文本中自动提取基因和/或蛋白质相互作用的潜在非专业用户群体的需求。生物医学文本挖掘社区应将易于安装和使用且性能达到最先进水平的提取工具的公开发布视为高度优先事项。

相似文献

A realistic assessment of methods for extracting gene/protein interactions from free text.从自由文本中提取基因/蛋白质相互作用方法的现实评估。

BMC Bioinformatics. 2009 Jul 28;10:233. doi: 10.1186/1471-2105-10-233.

A hybrid named entity tagger for tagging human proteins/genes.一种用于标记人类蛋白质/基因的混合命名实体标记器。

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.生物学文本挖掘系统评估：第二届生物创意社区挑战赛概述

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

Hash subgraph pairwise kernel for protein-protein interaction extraction.基于哈希子图的成对核函数用于蛋白质-蛋白质相互作用提取。

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):1190-202. doi: 10.1109/TCBB.2012.50.

Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database.迈向半自动化策展：使用文本挖掘技术重现 HIV-1 与人类蛋白质相互作用数据库。

Database (Oxford). 2012 Apr 23;2012:bas023. doi: 10.1093/database/bas023. Print 2012.

Evaluation of BioCreAtIvE assessment of task 2.生物创意任务2评估的评价

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

PPIExtractor: a protein interaction extraction and visualization system for biomedical literature.PPIExtractor：一种用于生物医学文献的蛋白质相互作用提取和可视化系统。

IEEE Trans Nanobioscience. 2013 Sep;12(3):173-81. doi: 10.1109/TNB.2013.2263837. Epub 2013 Aug 21.

Efficient extraction of protein-protein interactions from full-text articles.从全文文章中高效提取蛋白质-蛋白质相互作用。

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):481-94. doi: 10.1109/TCBB.2010.51.

Text mining in livestock animal science: introducing the potential of text mining to animal sciences.文本挖掘在畜牧动物科学中的应用：介绍文本挖掘在动物科学中的应用潜力。

J Anim Sci. 2012 Oct;90(10):3666-76. doi: 10.2527/jas.2011-4841. Epub 2012 Jun 4.

A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.从文献中提取蛋白质-蛋白质相互作用的核方法综合基准测试

PLoS Comput Biol. 2010 Jul 1;6(7):e1000837. doi: 10.1371/journal.pcbi.1000837.

引用本文的文献

A Study of Biomedical Relation Extraction Using GPT Models.一项使用GPT模型进行生物医学关系提取的研究。

AMIA Jt Summits Transl Sci Proc. 2024 May 31;2024:391-400. eCollection 2024.

Modeling genotype-protein interaction and correlation for Alzheimer's disease: a multi-omics imaging genetics study.阿尔茨海默病的基因型-蛋白相互作用和相关性建模：一项多组学影像遗传学研究。

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae038.

Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed.从 PubMed 中提取与人相关的蛋白质磷酸化信息的文本挖掘和机器学习协议。

Methods Mol Biol. 2022;2496:159-177. doi: 10.1007/978-1-0716-2305-3_9.

Biomedical Literature Mining and Its Components.生物医学文献挖掘及其组成部分。

Methods Mol Biol. 2022;2496:1-16. doi: 10.1007/978-1-0716-2305-3_1.

Automated recognition of functional compound-protein relationships in literature.文献中功能化合物-蛋白质关系的自动识别。

PLoS One. 2020 Mar 3;15(3):e0220925. doi: 10.1371/journal.pone.0220925. eCollection 2020.

PIPE: a protein-protein interaction passage extraction module for BioCreative challenge.PIPE：用于生物创意挑战的蛋白质-蛋白质相互作用通路提取模块

Database (Oxford). 2016 Aug 14;2016. doi: 10.1093/database/baw101. Print 2016.

Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features.通过评估由相关特征组成的组的贡献水平进行特征选择的蛋白质-蛋白质相互作用提取。

BMC Bioinformatics. 2016 Jul 25;17 Suppl 7(Suppl 7):246. doi: 10.1186/s12859-016-1100-z.

An integrated text mining framework for metabolic interaction network reconstruction.用于代谢相互作用网络重建的集成文本挖掘框架。

PeerJ. 2016 Mar 21;4:e1811. doi: 10.7717/peerj.1811. eCollection 2016.

Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations.用图算法弥合语义与句法——提取生物医学关系的研究现状

Brief Bioinform. 2017 Jan;18(1):160-178. doi: 10.1093/bib/bbw001. Epub 2016 Feb 5.

Biocuration with insufficient resources and fixed timelines.在资源不足且时间线固定的情况下进行生物数据编目。

Database (Oxford). 2015 Dec 26;2015. doi: 10.1093/database/bav116. Print 2015.

本文引用的文献

Concept recognition for extracting protein interaction relations from biomedical text.从生物医学文本中提取蛋白质相互作用关系的概念识别

Genome Biol. 2008;9 Suppl 2(Suppl 2):S9. doi: 10.1186/gb-2008-9-s2-s9. Epub 2008 Sep 1.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.生物创意II蛋白质-蛋白质相互作用注释提取任务概述。

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

Comparative analysis of five protein-protein interaction corpora.五个蛋白质-蛋白质相互作用语料库的比较分析。

BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S6. doi: 10.1186/1471-2105-9-S3-S6.

Extraction of protein interaction data: a comparative analysis of methods in use.蛋白质相互作用数据的提取：对现有方法的比较分析

EURASIP J Bioinform Syst Biol. 2007;2007(1):53096. doi: 10.1155/2007/53096.

OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression.OpenDMAP：一个开源的、由本体驱动的概念分析引擎，应用于捕获有关蛋白质转运、蛋白质相互作用和细胞类型特异性基因表达的知识。

BMC Bioinformatics. 2008 Jan 31;9:78. doi: 10.1186/1471-2105-9-78.

BANNER: an executable survey of advances in biomedical named entity recognition.横幅：生物医学命名实体识别进展的可执行调查。

Pac Symp Biocomput. 2008:652-63.

Corpus annotation for mining biomedical events from literature.用于从文献中挖掘生物医学事件的语料库标注。

BMC Bioinformatics. 2008 Jan 8;9:10. doi: 10.1186/1471-2105-9-10.

Text processing through Web services: calling Whatizit.通过网络服务进行文本处理：调用Whatizit。

Bioinformatics. 2008 Jan 15;24(2):296-8. doi: 10.1093/bioinformatics/btm557. Epub 2007 Nov 15.

Corpus refactoring: a feasibility study.语料库重构：一项可行性研究。

J Biomed Discov Collab. 2007 Sep 13;2:4. doi: 10.1186/1747-5333-2-4.

BioInfer: a corpus for information extraction in the biomedical domain.生物推理（BioInfer）：一个用于生物医学领域信息提取的语料库。

BMC Bioinformatics. 2007 Feb 9;8:50. doi: 10.1186/1471-2105-8-50.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验