Institute of Organismic and Molecular Evolution, Johannes Gutenberg University of Mainz, Mainz, Germany.
Department of Biomedical Science, University of Padova, Padova, Italy.
Brief Bioinform. 2020 Mar 23;21(2):458-472. doi: 10.1093/bib/bbz007.
There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, and more generally the overlaps between different properties related to LCRs, using examples. We argue that statistical measures alone cannot capture all structural aspects of LCRs and recommend the combined usage of a variety of predictive tools and measurements. While the methodologies available to study LCRs are already very advanced, we foresee that a more comprehensive annotation of sequences in the databases will enable the improvement of predictions and a better understanding of the evolution and the connection between structure and function of LCRs. This will require the use of standards for the generation and exchange of data describing all aspects of LCRs.
There are multiple definitions for low complexity regions (LCRs) in protein sequences. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, plus overlaps between different properties related to LCRs, using examples.
蛋白质序列中有多种低复杂度区域 (LCR) 的定义,所有这些定义都广泛地将 LCR 视为与平均组成相比氨基酸类型较少的区域。按照这种观点,LCR 也可以定义为表现出组成偏向的区域。在这篇评论中,我们专注于 LCR 序列复杂度的定义及其与结构的联系。我们提出了衡量低复杂度 (LC) 和相关序列特性的统计和方法学方法。组成偏向通常与 LC 和无序相关,但重复序列虽然组成上偏向,但也可能诱导有序结构。我们用例子来说明这种二分法,以及更普遍地说明与 LCR 相关的不同特性之间的重叠。我们认为,单独的统计测量方法无法捕捉到 LCR 的所有结构方面,并建议结合使用各种预测工具和测量方法。虽然研究 LCR 的方法已经非常先进,但我们预计数据库中序列的更全面注释将能够改进预测,并更好地理解 LCR 的进化以及结构与功能之间的联系。这将需要使用生成和交换描述 LCR 所有方面的数据的标准。