Necci Marco, Piovesan Damiano, Tosatto Silvio C E
Department of Biomedical Sciences and CRIBI Biotech Center, University of Padua, Padua, Italy.
CNR Institute of Neuroscience, Padua, Italy.
Protein Sci. 2016 Dec;25(12):2164-2174. doi: 10.1002/pro.3041. Epub 2016 Oct 25.
Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large-scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence-based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures.
在过去十年中,蛋白质中的内在无序(ID)已被广泛描述;但蛋白质中ID的大规模分类大多缺失。在此,我们基于MobiDB中基于序列的预测,对UniProt数据库中蛋白质宇宙中的ID进行了广泛分析。几乎一半的序列包含至少五个残基的ID区域。约9%的蛋白质具有超过20个残基的长ID区域,这些区域在真核生物中更为丰富,且大多数情况下覆盖的序列不到20%。约6.7万个(超过8000万个中的)蛋白质的一个小子集是完全无序的,且大多存在于病毒中。大多数蛋白质只有一个ID,短ID沿序列均匀分布,长ID在序列中心的占比过高。Das和Pappu的带电残基组成被用于根据结构倾向和相应的功能富集对ID蛋白质进行分类。肿胀线圈似乎主要用作原核生物和真核生物中的结构成分以及用于生物合成。在细菌中,它们局限于类核中,在病毒中则提供DNA结合功能。线圈和发夹似乎专门用于核糖体结合和甲基化活动。球体和蝌蚪在真核生物中结合抗原,但在细菌中参与杀死其他生物和细胞溶解。未定义类别被细菌用于结合有毒物质,并在病毒中介导生物体之间和体内的运输和移动。完全无序的蛋白质表现类似,但富含甘氨酸残基和细胞外结构。