ResidueFinder：从蛋白质文献中提取单个残基的提及。

ResidueFinder: extracting individual residue mentions from protein literature.

机构信息

Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.

Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.

出版信息

J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3.

DOI:10.1186/s13326-021-00243-3

PMID:34289903

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8293528/

Abstract

BACKGROUND

The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts.

RESULTS

We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called "cut") which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute F for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted.

CONCLUSIONS

ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed.

摘要

背景

分子生物学的革命表明，蛋白质的功能和结构是基于氨基酸的特定序列。因此，在许多论文中，一个重要的特征是提到蛋白质整个序列中单个氨基酸的重要性。MutationFinder 是一个广泛用于在文本中查找特定突变的程序。我们报告了通过增加更全面的正则表达式列表来增强 MutationFinder 的积极属性，从而创建了 ResidueFinder，它可以找到天然氨基酸和突变的提及。我们还考虑了 ResidueFinder 和 MutationFinder 的参数选项，以探索精度、召回率和计算效率之间的权衡。我们在全文和摘要中测试了我们的方法和软件。

结果

我们发现，在论文的全文中提到残基的格式比摘要中要多得多。如果不考虑这些多种格式，程序中会出现许多假阴性。由于 MutationFinder 与其他几个程序一样，主要在摘要上进行了测试，我们发现有必要构建一个扩展的正则表达式列表，以在全文搜索中获得可接受的召回率。我们还发现了一些由 PDF 到文本转换引起的伪影，我们在正则表达式库中编写了一些元素来解决这些问题。考虑到这些因素，在随机选择的主要研究文章中实现了高召回率。我们还开发了一种简化的正则表达式（称为“cut”），它可以在 MutationFinder 和 ResidueFinder 中实现数百倍的速度提升，而召回率的损失很小。所有的正则表达式都使用扩展的 F 度量统计进行了测试，即我们为不同的值计算 F，其中值越大，召回率的权重越大，值越小，精度的权重越大。

结论

ResidueFinder 是一个简单、有效和高效的程序，用于从文本文件开始在主要文献中查找单个残基提及，它用 Python 实现，并在 SourceForge.net 上提供。ResidueFinder 最具计算效率的版本可以实现创建和维护一个包含 PubMed 中所有文章的残基提及数据库。

相似文献

ResidueFinder: extracting individual residue mentions from protein literature.

J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3.

MutationFinder: a high-performance system for extracting point mutation mentions from text.

Bioinformatics. 2007 Jul 15;23(14):1862-5. doi: 10.1093/bioinformatics/btm235. Epub 2007 May 11.

Rapid pattern development for concept recognition systems: application to point mutations.

J Bioinform Comput Biol. 2007 Dec;5(6):1233-59. doi: 10.1142/s0219720007003144.

LINNAEUS: a species name identification system for biomedical literature.

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

nala: text mining natural language mutation mentions.

Bioinformatics. 2017 Jun 15;33(12):1852-1858. doi: 10.1093/bioinformatics/btx083.

Gene name identification and normalization using a model organism database.

J Biomed Inform. 2004 Dec;37(6):396-410. doi: 10.1016/j.jbi.2004.08.010.

Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature.

Bioinformatics. 2011 Feb 1;27(3):408-15. doi: 10.1093/bioinformatics/btq667. Epub 2010 Dec 7.

Ambiguity and variability of database and software names in bioinformatics.

J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.

Automated recognition of malignancy mentions in biomedical literature.

BMC Bioinformatics. 2006 Nov 7;7:492. doi: 10.1186/1471-2105-7-492.

SciMiner: web-based literature mining tool for target identification and functional enrichment analysis.

Bioinformatics. 2009 Mar 15;25(6):838-40. doi: 10.1093/bioinformatics/btp049. Epub 2009 Feb 2.

本文引用的文献

UniProt: a worldwide hub of protein knowledge.

Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.

Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.

SETH detects and normalizes genetic variants in text.

Bioinformatics. 2016 Sep 15;32(18):2883-5. doi: 10.1093/bioinformatics/btw234. Epub 2016 Jun 2.

Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.

J Am Med Inform Assoc. 2016 Jul;23(4):766-72. doi: 10.1093/jamia/ocw041. Epub 2016 Apr 27.

BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.

Database (Oxford). 2016 Apr 13;2016. doi: 10.1093/database/baw043. Print 2016.

DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

PLoS One. 2016 Apr 13;11(4):e0152725. doi: 10.1371/journal.pone.0152725. eCollection 2016.

Mutation extraction tools can be combined for robust recognition of genetic variants in the literature.

F1000Res. 2014 Jan 21;3:18. doi: 10.12688/f1000research.3-18.v2. eCollection 2014.

Resolving hot spots in the C-terminal dimerization domain that determine the stability of the molecular chaperone Hsp90.

PLoS One. 2014 Apr 23;9(4):e96031. doi: 10.1371/journal.pone.0096031. eCollection 2014.

Benchmarking infrastructure for mutation text mining.

J Biomed Semantics. 2014 Feb 25;5(1):11. doi: 10.1186/2041-1480-5-11.

Annotating the biomedical literature for the human variome.

Database (Oxford). 2013 Apr 12;2013:bat019. doi: 10.1093/database/bat019. Print 2013.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ResidueFinder：从蛋白质文献中提取单个残基的提及。

ResidueFinder: extracting individual residue mentions from protein literature.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献