Suppr超能文献

从全文文章中高效提取蛋白质-蛋白质相互作用。

Efficient extraction of protein-protein interactions from full-text articles.

机构信息

Department of Computer Science, Arizona State University, Tempe, AZ 85281-8809, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):481-94. doi: 10.1109/TCBB.2010.51.

Abstract

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend- - third-party software are available as supplementary information (see Appendix).

摘要

蛋白质及其相互作用几乎控制着所有的细胞过程,如调控、信号、代谢和结构。大多数关于这些相互作用的实验结果都在研究论文中讨论,这些论文反过来又被蛋白质相互作用数据库整理。作者、编辑和出版商受益于减轻搜索相关论文、物理相互作用证据以及每个参与蛋白质的正确标识符的任务。BioCreative II.5 社区挑战赛以竞赛式评估的方式解决了这些任务,以评估和比较不同的方法,提高对自动化方法准确性的认识,并指导未来的实现。在本文中,我们介绍了我们用于蛋白质命名实体识别的方法,包括规范化和从全文中提取蛋白质-蛋白质相互作用的方法。我们的总体目标是识别有效的单个组件,并比较各种成分来处理 10 秒到 2 分钟之间的单个全文文章。我们提出了将文档级注释转移到句子级的策略,这允许创建更细粒度的训练语料库;我们使用这个语料库自动派生大约 5000 个模式。我们通过与具有物理证据的新交互任务的相关性对句子进行排名,使用从这个训练语料库构建的句子分类器。对于将句子重新措辞以帮助进一步去除可能干扰模式的不必要信息的启发式方法,例如额外的形容词、子句或括号表达式。在 BioCreative II.5 中,我们在发现蛋白质相互作用方面的 f 分数达到了 22%,在将蛋白质映射到 UniProt ID 方面的 f 分数达到了 43%;不考虑物种,f 分数分别为 30%和 55%。平均而言,我们表现最好的设置需要大约 2 分钟才能处理完全文。所有数据和模式集以及扩展的 Java 类——第三方软件都可以作为补充信息(见附录)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验