Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3.
The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts.
We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called "cut") which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute F for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted.
ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed.
分子生物学的革命表明,蛋白质的功能和结构是基于氨基酸的特定序列。因此,在许多论文中,一个重要的特征是提到蛋白质整个序列中单个氨基酸的重要性。MutationFinder 是一个广泛用于在文本中查找特定突变的程序。我们报告了通过增加更全面的正则表达式列表来增强 MutationFinder 的积极属性,从而创建了 ResidueFinder,它可以找到天然氨基酸和突变的提及。我们还考虑了 ResidueFinder 和 MutationFinder 的参数选项,以探索精度、召回率和计算效率之间的权衡。我们在全文和摘要中测试了我们的方法和软件。
我们发现,在论文的全文中提到残基的格式比摘要中要多得多。如果不考虑这些多种格式,程序中会出现许多假阴性。由于 MutationFinder 与其他几个程序一样,主要在摘要上进行了测试,我们发现有必要构建一个扩展的正则表达式列表,以在全文搜索中获得可接受的召回率。我们还发现了一些由 PDF 到文本转换引起的伪影,我们在正则表达式库中编写了一些元素来解决这些问题。考虑到这些因素,在随机选择的主要研究文章中实现了高召回率。我们还开发了一种简化的正则表达式(称为“cut”),它可以在 MutationFinder 和 ResidueFinder 中实现数百倍的速度提升,而召回率的损失很小。所有的正则表达式都使用扩展的 F 度量统计进行了测试,即我们为不同的值计算 F,其中值越大,召回率的权重越大,值越小,精度的权重越大。
ResidueFinder 是一个简单、有效和高效的程序,用于从文本文件开始在主要文献中查找单个残基提及,它用 Python 实现,并在 SourceForge.net 上提供。ResidueFinder 最具计算效率的版本可以实现创建和维护一个包含 PubMed 中所有文章的残基提及数据库。