Lavoie Hugo, Debeane Francois, Trinh Quoc-Dien, Turcotte Jean-Francois, Corbeil-Girard Louis-Philippe, Dicaire Marie-Josée, Saint-Denis Anik, Pagé Martin, Rouleau Guy A, Brais Bernard
Laboratoire de Neurogénétique, Centre de Recherche du Centre Hospitalier de l'Université de Montréal, Québec, Canada.
Hum Mol Genet. 2003 Nov 15;12(22):2967-79. doi: 10.1093/hmg/ddg329. Epub 2003 Sep 30.
Mutations causing expansions of polyalanine domains are responsible for nine hereditary diseases. Other GC-rich sequences coding for some polyalanine domains were found to be polymorphic in human. These observations prompted us to identify all sequences in the human genome coding for polyalanine stretches longer than four alanines and establish their degree of polymorphism. We identified 494 annotated human proteins containing 604 polyalanine domains. Thirty-two percent (31/98) of tested sequences coding for more than seven alanines were polymorphic. The length of the polyalanine-coding sequence and its GCG or GCC repeat content are the major predictors of polymorphism. GCG codons are over-represented in human polyalanine coding sequences. Our data suggest that GCG and GCC codons play a key role in polyalanine-coding sequence appearance and polymorphism. The grouping by shared function of polyalanine-containing proteins in Homo sapiens, Drosophila melanogaster and Caenorhabditis elegans shows that the majority are involved in transcriptional regulation. Phylogenetic analyses of HOX, GATA and EVX protein families demonstrate that polyalanine domains arose independently in different members of these families, suggesting that convergent molecular evolution may have played a role. Finally polyalanine domains in vertebrates are conserved between mammals and are rarer and shorter in Gallus gallus and Danio rerio. Together our results show that the polymorphic nature of sequences coding for polyalanine domains makes them prime candidates for mutations in hereditary diseases and suggests that they have appeared in many different protein families through convergent evolution.
导致聚丙氨酸结构域扩增的突变是九种遗传性疾病的病因。人们发现,其他编码某些聚丙氨酸结构域的富含GC的序列在人类中具有多态性。这些观察结果促使我们鉴定人类基因组中所有编码长度超过四个丙氨酸的聚丙氨酸片段的序列,并确定它们的多态性程度。我们鉴定出494种带注释的人类蛋白质,它们含有604个聚丙氨酸结构域。编码超过七个丙氨酸的测试序列中有32%(31/98)具有多态性。聚丙氨酸编码序列的长度及其GCG或GCC重复含量是多态性的主要预测指标。GCG密码子在人类聚丙氨酸编码序列中过度存在。我们的数据表明,GCG和GCC密码子在聚丙氨酸编码序列的出现和多态性中起关键作用。对智人、黑腹果蝇和秀丽隐杆线虫中含聚丙氨酸蛋白质按共享功能进行分组显示,大多数蛋白质参与转录调控。对HOX、GATA和EVX蛋白家族的系统发育分析表明,聚丙氨酸结构域在这些家族的不同成员中独立出现,这表明趋同分子进化可能发挥了作用。最后,脊椎动物中的聚丙氨酸结构域在哺乳动物之间是保守的,而在原鸡和斑马鱼中则较少且较短。我们的研究结果共同表明,编码聚丙氨酸结构域的序列的多态性使其成为遗传性疾病突变的主要候选对象,并表明它们通过趋同进化出现在许多不同的蛋白质家族中。