• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ResidueFinder:从蛋白质文献中提取单个残基的提及。

ResidueFinder: extracting individual residue mentions from protein literature.

机构信息

Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.

Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.

出版信息

J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3.

DOI:10.1186/s13326-021-00243-3
PMID:34289903
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8293528/
Abstract

BACKGROUND

The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts.

RESULTS

We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called "cut") which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute F for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted.

CONCLUSIONS

ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed.

摘要

背景

分子生物学的革命表明,蛋白质的功能和结构是基于氨基酸的特定序列。因此,在许多论文中,一个重要的特征是提到蛋白质整个序列中单个氨基酸的重要性。MutationFinder 是一个广泛用于在文本中查找特定突变的程序。我们报告了通过增加更全面的正则表达式列表来增强 MutationFinder 的积极属性,从而创建了 ResidueFinder,它可以找到天然氨基酸和突变的提及。我们还考虑了 ResidueFinder 和 MutationFinder 的参数选项,以探索精度、召回率和计算效率之间的权衡。我们在全文和摘要中测试了我们的方法和软件。

结果

我们发现,在论文的全文中提到残基的格式比摘要中要多得多。如果不考虑这些多种格式,程序中会出现许多假阴性。由于 MutationFinder 与其他几个程序一样,主要在摘要上进行了测试,我们发现有必要构建一个扩展的正则表达式列表,以在全文搜索中获得可接受的召回率。我们还发现了一些由 PDF 到文本转换引起的伪影,我们在正则表达式库中编写了一些元素来解决这些问题。考虑到这些因素,在随机选择的主要研究文章中实现了高召回率。我们还开发了一种简化的正则表达式(称为“cut”),它可以在 MutationFinder 和 ResidueFinder 中实现数百倍的速度提升,而召回率的损失很小。所有的正则表达式都使用扩展的 F 度量统计进行了测试,即我们为不同的值计算 F,其中值越大,召回率的权重越大,值越小,精度的权重越大。

结论

ResidueFinder 是一个简单、有效和高效的程序,用于从文本文件开始在主要文献中查找单个残基提及,它用 Python 实现,并在 SourceForge.net 上提供。ResidueFinder 最具计算效率的版本可以实现创建和维护一个包含 PubMed 中所有文章的残基提及数据库。

相似文献

1
ResidueFinder: extracting individual residue mentions from protein literature.ResidueFinder:从蛋白质文献中提取单个残基的提及。
J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3.
2
MutationFinder: a high-performance system for extracting point mutation mentions from text.MutationFinder:一个用于从文本中提取点突变提及信息的高性能系统。
Bioinformatics. 2007 Jul 15;23(14):1862-5. doi: 10.1093/bioinformatics/btm235. Epub 2007 May 11.
3
Rapid pattern development for concept recognition systems: application to point mutations.用于概念识别系统的快速模式开发:应用于点突变
J Bioinform Comput Biol. 2007 Dec;5(6):1233-59. doi: 10.1142/s0219720007003144.
4
LINNAEUS: a species name identification system for biomedical literature.林奈氏:生物医学文献的物种名称识别系统。
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.
5
nala: text mining natural language mutation mentions.纳拉:文本挖掘自然语言中的突变提及。
Bioinformatics. 2017 Jun 15;33(12):1852-1858. doi: 10.1093/bioinformatics/btx083.
6
Gene name identification and normalization using a model organism database.使用模式生物数据库进行基因名称识别与标准化
J Biomed Inform. 2004 Dec;37(6):396-410. doi: 10.1016/j.jbi.2004.08.010.
7
Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature.从生物医学文献中自动提取癌症和其他疾病相关点突变的方法。
Bioinformatics. 2011 Feb 1;27(3):408-15. doi: 10.1093/bioinformatics/btq667. Epub 2010 Dec 7.
8
Ambiguity and variability of database and software names in bioinformatics.生物信息学中数据库和软件名称的模糊性与变异性。
J Biomed Semantics. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0. eCollection 2015.
9
Automated recognition of malignancy mentions in biomedical literature.生物医学文献中恶性肿瘤提及的自动识别。
BMC Bioinformatics. 2006 Nov 7;7:492. doi: 10.1186/1471-2105-7-492.
10
SciMiner: web-based literature mining tool for target identification and functional enrichment analysis.SciMiner:用于靶点识别和功能富集分析的基于网络的文献挖掘工具。
Bioinformatics. 2009 Mar 15;25(6):838-40. doi: 10.1093/bioinformatics/btp049. Epub 2009 Feb 2.

本文引用的文献

1
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
2
tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.tmVar 2.0:整合文献中的基因组变异信息与 dbSNP 和 ClinVar,以用于精准医学。
Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.
3
SETH detects and normalizes genetic variants in text.SETH可检测并规范文本中的基因变异。
Bioinformatics. 2016 Sep 15;32(18):2883-5. doi: 10.1093/bioinformatics/btw234. Epub 2016 Jun 2.
4
Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.精准医学的文本挖掘:从生物医学文献中自动提取疾病-突变关系
J Am Med Inform Assoc. 2016 Jul;23(4):766-72. doi: 10.1093/jamia/ocw041. Epub 2016 Apr 27.
5
BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.BRONCO:用于提取基因-变异-疾病-药物关系的生物医学实体关系肿瘤语料库。
Database (Oxford). 2016 Apr 13;2016. doi: 10.1093/database/baw043. Print 2016.
6
DiMeX: A Text Mining System for Mutation-Disease Association Extraction.DiMeX:一种用于提取突变-疾病关联的文本挖掘系统。
PLoS One. 2016 Apr 13;11(4):e0152725. doi: 10.1371/journal.pone.0152725. eCollection 2016.
7
Mutation extraction tools can be combined for robust recognition of genetic variants in the literature.突变提取工具可以组合起来,以便在文献中对基因变异进行可靠识别。
F1000Res. 2014 Jan 21;3:18. doi: 10.12688/f1000research.3-18.v2. eCollection 2014.
8
Resolving hot spots in the C-terminal dimerization domain that determine the stability of the molecular chaperone Hsp90.解析C端二聚化结构域中决定分子伴侣Hsp90稳定性的热点区域。
PLoS One. 2014 Apr 23;9(4):e96031. doi: 10.1371/journal.pone.0096031. eCollection 2014.
9
Benchmarking infrastructure for mutation text mining.用于突变文本挖掘的基准测试基础设施。
J Biomed Semantics. 2014 Feb 25;5(1):11. doi: 10.1186/2041-1480-5-11.
10
Annotating the biomedical literature for the human variome.注释人类变异组的生物医学文献。
Database (Oxford). 2013 Apr 12;2013:bat019. doi: 10.1093/database/bat019. Print 2013.