Suppr超能文献

fLPS:蛋白质宇宙组成偏差的快速发现

fLPS: Fast discovery of compositional biases for the protein universe.

作者信息

Harrison Paul M

机构信息

Department of Biology, McGill University, Montreal, QC, Canada.

出版信息

BMC Bioinformatics. 2017 Nov 13;18(1):476. doi: 10.1186/s12859-017-1906-3.

Abstract

BACKGROUND

Proteins often contain regions that are compositionally biased (CB), i.e., they are made from a small subset of amino-acid residue types. These CB regions can be functionally important, e.g., the prion-forming and prion-like regions that are rich in asparagine and glutamine residues.

RESULTS

Here I report a new program fLPS that can rapidly annotate CB regions. It discovers both single-residue and multiple-residue biases. It works through a process of probability minimization. First, contigs are constructed for each amino-acid type out of sequence windows with a low degree of bias; second, these contigs are searched exhaustively for low-probability subsequences (LPSs); third, such LPSs are iteratively assessed for merger into possible multiple-residue biases. At each of these stages, efficiency measures are taken to avoid or delay probability calculations unless/until they are necessary. On a current desktop workstation, the fLPS algorithm can annotate the biased regions of the yeast proteome (>5700 sequences) in <1 s, and of the whole current TrEMBL database (>65 million sequences) in as little as ~1 h, which is >2 times faster than the commonly used program SEG, using default parameters. fLPS discovers both shorter CB regions (of the sort that are often termed 'low-complexity sequence'), and milder biases that may only be detectable over long tracts of sequence.

CONCLUSIONS

fLPS can readily handle very large protein data sets, such as might come from metagenomics projects. It is useful in searching for proteins with similar CB regions, and for making functional inferences about CB regions for a protein of interest. The fLPS package is available from: http://biology.mcgill.ca/faculty/harrison/flps.html , or https://github.com/pmharrison/flps , or is a supplement to this article.

摘要

背景

蛋白质通常包含组成性偏向(CB)区域,即它们由一小部分氨基酸残基类型组成。这些CB区域可能具有重要功能,例如富含天冬酰胺和谷氨酰胺残基的朊病毒形成区域和类朊病毒区域。

结果

在此,我报告了一个新程序fLPS,它可以快速注释CB区域。它能发现单残基和多残基偏向。它通过概率最小化过程工作。首先,针对每个氨基酸类型,从具有低偏向程度的序列窗口构建重叠群;其次,对这些重叠群进行穷举搜索以寻找低概率子序列(LPS);第三,对这些LPS进行迭代评估,以合并成可能的多残基偏向。在这些阶段的每一步,都采取了效率措施,以避免或延迟概率计算,除非/直到有必要进行计算。在当前的桌面工作站上,fLPS算法可以在不到1秒的时间内注释酵母蛋白质组(>5700个序列)的偏向区域,对于整个当前的TrEMBL数据库(>6500万个序列),使用默认参数时,只需约1小时,这比常用程序SEG快2倍以上。fLPS既可以发现较短的CB区域(通常称为“低复杂性序列”),也可以发现可能仅在长序列片段中才能检测到的较温和的偏向。

结论

fLPS可以轻松处理非常大的蛋白质数据集,例如可能来自宏基因组学项目的数据集。它有助于搜索具有相似CB区域的蛋白质,并对感兴趣蛋白质的CB区域进行功能推断。fLPS软件包可从以下网址获取:http://biology.mcgill.ca/faculty/harrison/flps.html ,或https://github.com/pmharrison/flps ,或者是本文的补充内容。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d944/5684748/4d5d07e2251a/12859_2017_1906_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验