Yip Yum Lina, Lachenal Nathalie, Pillet Violaine, Veuthey Anne-Lise
Swiss-Prot Group, Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1, rue Michel-Servet, 1211, Geneva 4, Switzerland.
J Bioinform Comput Biol. 2007 Dec;5(6):1215-31. doi: 10.1142/s021972000700320x.
The UniProt/Swiss-Prot Knowledgebase records about 30,500 variants in 5,664 proteins (Release 52.2). Most of these variants are manually curated single amino acid polymorphisms (SAPs) with references to the literature. In order to keep the list of published documents related to SAPs up to date, an automatic information retrieval method is developed to recover texts mentioning SAPs. The method is based on the use of regular expressions (patterns) and rules for the detection and validation of mutations. When evaluated using a corpus of 9,820 PubMed references, the precision of the retrieval was determined to be 89.5% over all variants. It was also found that the use of nonstandard mutation nomenclature and sequence positional correction is necessary to retrieve a significant number of relevant articles. The method was applied to the 5,664 proteins with variants. This was performed by first submitting a PubMed query to retrieve articles using gene or protein names and a list of mutation-related keywords; the SAP detection procedure was then used to recover relevant documents. The method was found to be efficient in retrieving new references on known polymorphisms. New references on known SAPs will be rendered accessible to the public via the Swiss-Prot variant pages.
UniProt/Swiss-Prot知识库记录了5664种蛋白质中的约30500个变体(第52.2版)。这些变体中的大多数是人工整理的单氨基酸多态性(SAP),并附有文献参考。为了使与SAP相关的已发表文献列表保持最新,开发了一种自动信息检索方法来找回提及SAP的文本。该方法基于使用正则表达式(模式)和用于检测和验证突变的规则。当使用9820篇PubMed参考文献的语料库进行评估时,所有变体的检索精度确定为89.5%。还发现,使用非标准突变命名法和序列位置校正对于检索大量相关文章是必要的。该方法应用于有变体的5664种蛋白质。这首先通过提交PubMed查询来使用基因或蛋白质名称以及与突变相关的关键词列表检索文章来进行;然后使用SAP检测程序来找回相关文档。发现该方法在检索已知多态性的新参考文献方面是有效的。关于已知SAP的新参考文献将通过Swiss-Prot变体页面向公众提供。