Price Morgan N, Arkin Adam P
Environmental Genomics & System Biology, Lawrence Berkeley National Lab, Berkeley, California, USA.
mSystems. 2017 Aug 15;2(4). doi: 10.1128/mSystems.00039-17. eCollection 2017 Jul-Aug.
Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST's database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins' functions.
大规模基因组测序已识别出数百万个功能未知的蛋白质编码基因。这些蛋白质中有许多与其他生物体中已明确特征的蛋白质相似,但注释数据库中缺少很多此类信息,且这些信息隐藏在科学文献中。为了让这些信息便于获取,PaperBLAST利用欧洲分子生物学开放数据库(EuropePMC)搜索科学文章全文中对基因的引用。PaperBLAST还利用了将蛋白质序列与科学文章相链接的精选资源(瑞士蛋白质数据库、基因参考整合数据库和大肠杆菌代谢数据库)。PaperBLAST的数据库包含70多万篇提及40多万种不同蛋白质的科学文章。给定一个感兴趣的蛋白质,PaperBLAST能快速找到文献中讨论的相似蛋白质,并呈现相关文章或来自编辑者的文本片段。可通过http://papers.genomics.lbl.gov/访问PaperBLAST。随着近期基因组测序数据的激增,现在有数百万种未表征的蛋白质。如果一位科学家对其中一种蛋白质感兴趣,可能很难找到关于其可能功能的信息。通常,一种序列相似且可能具有相似功能的蛋白质已经被研究过,但该信息在任何数据库中都无法获取。为了帮助找到关于相似蛋白质的文章,PaperBLAST在科学文章全文中搜索蛋白质标识符或基因标识符,并将这些文章与蛋白质序列相链接。然后,给定一个感兴趣的蛋白质,它可以通过使用标准软件(BLAST)在其数据库中快速找到相似蛋白质,并展示相关论文的文本片段。我们希望PaperBLAST能让生物学家更容易预测蛋白质的功能。