Suppr超能文献

序列屏蔽算法的比较与有偏蛋白序列区域的检测

Comparison of sequence masking algorithms and the detection of biased protein sequence regions.

作者信息

Kreil David P, Ouzounis Christos A

机构信息

Department of Genetics/Inference Group (Cavendish Laboratory), University of Cambridge, Cambridge, UK.

出版信息

Bioinformatics. 2003 Sep 1;19(13):1672-81. doi: 10.1093/bioinformatics/btg212.

Abstract

MOTIVATION

Separation of protein sequence regions according to their local information complexity and subsequent masking of low complexity regions has greatly enhanced the reliability of function prediction by sequence similarity. Comparisons with alternative methods that focus on compositional sequence bias rather than information complexity measures have shown that removal of compositional bias yields at least as sensitive and much more specific results. Besides the application of sequence masking algorithms to sequence similarity searches, the study of the masked regions themselves is of great interest. Traditionally, however, these have been neglected despite evidence of their functional relevance.

RESULTS

Here we demonstrate that compositional bias seems to be a more effective measure for the detection of biologically meaningful signals. Typical results on proteins are compared to results for sequences that have been randomized in various ways, conserving composition and local correlations for individual proteins or the entire set. It is remarkable that low-complexity regions have the same form of distribution in proteins as in randomized sequences, and that the signal from randomized sequences with conserved local correlations and amino acid composition almost matches the signal from proteins. This is not the case for sequence bias, which hence seems to be a genuinely biological phenomenon in contrast to patches of low complexity.

摘要

动机

根据蛋白质序列区域的局部信息复杂性进行分离,并随后对低复杂性区域进行屏蔽,极大地提高了通过序列相似性进行功能预测的可靠性。与关注序列组成偏差而非信息复杂性度量的其他方法进行比较表明,消除组成偏差至少能产生同样敏感且更具特异性的结果。除了将序列屏蔽算法应用于序列相似性搜索外,对屏蔽区域本身的研究也非常有趣。然而,传统上这些区域一直被忽视,尽管有证据表明它们具有功能相关性。

结果

在这里我们证明,组成偏差似乎是检测生物学上有意义信号的更有效指标。将蛋白质的典型结果与以各种方式随机化的序列的结果进行比较,同时保留单个蛋白质或整个集合的组成和局部相关性。值得注意的是,低复杂性区域在蛋白质中的分布形式与在随机序列中的相同,并且具有保守局部相关性和氨基酸组成的随机序列的信号几乎与蛋白质的信号匹配。序列偏差并非如此,因此与低复杂性片段相比,它似乎是一种真正的生物学现象。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验