Department of Computer Networks and Systems, Silesian University of Technology, Akademicka 2A, 44-100, Gliwice, Poland.
Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5A, 02-106, Warsaw, Poland.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac299.
Low complexity regions are fragments of protein sequences composed of only a few types of amino acids. These regions frequently occur in proteins and can play an important role in their functions. However, scientists are mainly focused on regions characterized by high diversity of amino acid composition. Similarity between regions of protein sequences frequently reflect functional similarity between them. In this article, we discuss strengths and weaknesses of the similarity analysis of low complexity regions using BLAST, HHblits and CD-HIT. These methods are considered to be the gold standard in protein similarity analysis and were designed for comparison of high complexity regions. However, we lack specialized methods that could be used to compare the similarity of low complexity regions. Therefore, we investigated the existing methods in order to understand how they can be applied to compare such regions. Our results are supported by exploratory study, discussion of amino acid composition and biological roles of selected examples. We show that existing methods need improvements to efficiently search for similar low complexity regions. We suggest features that have to be re-designed specifically for comparing low complexity regions: scoring matrix, multiple sequence alignment, e-value, local alignment and clustering based on a set of representative sequences. Results of this analysis can either be used to improve existing methods or to create new methods for the similarity analysis of low complexity regions.
低复杂度区域是由少数几种氨基酸组成的蛋白质序列片段。这些区域经常出现在蛋白质中,并在其功能中起着重要作用。然而,科学家们主要关注的是氨基酸组成多样性高的区域。蛋白质序列区域之间的相似性通常反映了它们之间的功能相似性。在本文中,我们讨论了使用 BLAST、HHblits 和 CD-HIT 分析低复杂度区域相似性的优缺点。这些方法被认为是蛋白质相似性分析的黄金标准,是为比较高复杂度区域而设计的。然而,我们缺乏专门用于比较低复杂度区域相似性的方法。因此,我们研究了现有的方法,以了解它们如何应用于比较这些区域。我们的结果得到了探索性研究、选定示例的氨基酸组成和生物学作用的讨论的支持。我们表明,现有方法需要改进,以有效地搜索相似的低复杂度区域。我们建议重新设计专门用于比较低复杂度区域的特征:评分矩阵、多重序列比对、E 值、局部比对和基于一组代表性序列的聚类。此分析的结果可用于改进现有方法或为低复杂度区域的相似性分析创建新方法。