Siwach Pratibha, Pophaly Saurabh Dilip, Ganesh Subramaniam
Department of Biological Sciences and Bioengineering, Indian Institute of Technology, Kanpur, India.
Mol Biol Evol. 2006 Jul;23(7):1357-69. doi: 10.1093/molbev/msk022. Epub 2006 Apr 17.
Mutations causing expansion of amino acid repeats are responsible for 19 hereditary disorders. Repeats in several other proteins also show length variations. These observations prompted us to identify single amino acid repeat-containing proteins (SARPs) in humans and to understand their functional and evolutionary significance. We identified 8812 SARPs containing 17 146 repeat domains, each harboring 4 or more residues. In all, 5% of SARPs (471) showed repeat length variations, and nearly 84% of them (394) have repeats of 10 residues or less. We find that SARPs are involved in functions that require formation of multiprotein complexes. Nearly 78% (6859) of the SARPs did not find a paralogue in the human proteome, and such proteins are considered as orphan SARPs. Orphan SARPs show longer repeat stretches, longer peptide length, and lower expression levels as compared with SARPs belonging to protein family. Because the intensity of gene expression is known to relate inversely with the rate of protein sequence evolution, our results suggest that the orphan SARPs evolve faster than the familial forms and therefore are under a weaker selection pressure. We also find that while GC-rich codons are favored for coding the repeat tracts of SARPs, specific codons and not nucleotide motifs per se are selected, suggesting functional constraints placed on the usage of codons. One of the constraints could be the mRNA stability as clustering of rare codons is known to destabilize the transcripts and rare codons are not favored for coding repeat tracts. Genes encoding polymorphic SARPs show preferential localization toward the telomeric segments. Further, the sex-specific recombination rates of the chromosomal locus strongly correlate with the parental gender that influence the repeat instability in disorder caused by dynamic mutation. Therefore, instability associated with repeats might be driven by processes that are specific to sperm or oocyte development, and the recombination frequency might play a positive role in this process.
导致氨基酸重复序列扩增的突变是19种遗传性疾病的病因。其他几种蛋白质中的重复序列也存在长度变异。这些观察结果促使我们在人类中鉴定含单氨基酸重复序列的蛋白质(SARP),并了解它们的功能和进化意义。我们鉴定出8812个SARP,包含17146个重复结构域,每个结构域含有4个或更多残基。总体而言,5%的SARP(471个)表现出重复长度变异,其中近84%(394个)的重复序列长度为10个残基或更短。我们发现SARP参与需要形成多蛋白复合物的功能。近78%(6859个)的SARP在人类蛋白质组中未找到旁系同源物,这类蛋白质被视为孤儿SARP。与属于蛋白质家族的SARP相比,孤儿SARP的重复序列延伸更长、肽长度更长且表达水平更低。由于已知基因表达强度与蛋白质序列进化速率成反比,我们的结果表明孤儿SARP的进化速度比家族形式更快,因此处于较弱的选择压力之下。我们还发现,虽然富含GC的密码子有利于编码SARP的重复序列,但选择的是特定密码子而非核苷酸基序本身,这表明密码子的使用受到功能限制。其中一个限制可能是mRNA稳定性,因为已知稀有密码子的聚集会使转录本不稳定,且稀有密码子不利于编码重复序列。编码多态性SARP的基因表现出向端粒区段的优先定位。此外,染色体位点的性别特异性重组率与影响动态突变所致疾病中重复序列不稳定性的亲本性别密切相关。因此,与重复序列相关的不稳定性可能由精子或卵母细胞发育特有的过程驱动,而重组频率可能在此过程中发挥积极作用。