Mier Pablo, Andrade-Navarro Miguel A
Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany.
Comput Struct Biotechnol J. 2022 Sep 18;20:5516-5523. doi: 10.1016/j.csbj.2022.09.011. eCollection 2022.
Low complexity regions (LCRs) differ in amino acid composition from the background provided by the corresponding proteomes. The simplest LCRs are homorepeats (or polyX), regions composed of mostly-one amino acid type. Extensive research has been done to characterize homorepeats, and their taxonomic, functional and structural features depend on the amino acid type and sequence context. From them, the next step towards the study of LCRs are the regions composed of two types of amino acids, which we call polyXY. We classify polyXY in three categories based on the arrangement of the two amino acid types 'X' and 'Y': direpeats (e.g. 'XYXYXY'), joined (e.g. 'XXXYYY') and shuffled (e.g. 'XYYXXY'). We developed a script to search for polyXY, and located them in a comprehensive set of 20,340 reference proteomes. These results are available in a dedicated web server called XYs, in which the user can also submit their own protein datasets to detect polyXY. We studied the distribution of polyXY types by amino acid pair XY and category, and show that polyXY in Eukaryota are mainly located within intrinsically disordered regions. Our study provides a first step towards the characterization of polyXY as protein motifs.
低复杂度区域(LCRs)在氨基酸组成上与相应蛋白质组提供的背景不同。最简单的LCRs是同聚物重复序列(或多聚X),即主要由一种氨基酸类型组成的区域。人们已经对同聚物重复序列进行了广泛研究,其分类学、功能和结构特征取决于氨基酸类型和序列背景。在此基础上,研究LCRs的下一步是研究由两种氨基酸组成的区域,我们将其称为多聚XY。我们根据两种氨基酸类型“X”和“Y”的排列方式将多聚XY分为三类:直接重复序列(如“XYXYXY”)、连接重复序列(如“XXXYYY”)和洗牌重复序列(如“XYYXXY”)。我们开发了一个脚本来搜索多聚XY,并在一组包含20340个参考蛋白质组的综合数据集中定位它们。这些结果可在一个名为XYs的专用网络服务器上获取,用户也可以在该服务器上提交自己的蛋白质数据集以检测多聚XY。我们通过氨基酸对XY和类别研究了多聚XY类型的分布,并表明真核生物中的多聚XY主要位于内在无序区域。我们的研究为将多聚XY表征为蛋白质基序迈出了第一步。