Dumontier Michel, Michalickova Katerina, Hogue Christopher W V
Department of Biochemistry, University of Toronto, Toronto, Ontario, M5S 1A8, Canada.
BMC Bioinformatics. 2002 Dec 17;3:39. doi: 10.1186/1471-2105-3-39.
An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.
Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archaea, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archaea and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 +/- 8% whereas the CG detected 73 +/- 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at http://genome.mshri.on.ca.
Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.
生物体适应其特定环境生态位的能力对其生存和繁殖至关重要。在同类规模最大的研究中,我们试图识别并利用那些使得在100个完整基因组中实现物种特异性蛋白质适应成为可能的氨基酸特征。
通过对应分析,利用来自17个古菌、76个细菌和7个真核生物完整基因组的超过360,000个预测开放阅读框(ORF)的氨基酸组成,确定环境生态位是变异性的一个重要因素。此外,我们通过氨基酸组成聚类发现了一些在系统发育上不相关的古菌和细菌集群,它们共享相似的环境。基于保守结构域的同源建模的组成分析表明,在所有基因组中,小的疏水残基丙氨酸(Ala)、甘氨酸(Gly)、缬氨酸(Val)以及带电荷的残基天冬氨酸(Asp)、谷氨酸(Glu)、组氨酸(His)和精氨酸(Arg)有所富集。然而,较大的芳香族残基苯丙氨酸(Phe)、色氨酸(Trp)和酪氨酸(Tyr)在折叠结构中减少,并且这些结果不受低复杂性偏差的影响。我们从每个完整基因组的开放阅读框(CG)和折叠结构(CF)中推导了两个简单的对数几率评分函数。当与所有其他非冗余的CG竞争时,CF实现了85±8%的平均交叉验证成功率,而CG检测到73±9%的物种特异性序列。可在http://genome.mshri.on.ca获取持续更新的结果。
我们对完整基因组氨基酸组成的分析为基因组序列以及折叠结构中的物种特异性和环境残基偏好提供了更有力的证据。这项工作推导的评分函数将在未来的蛋白质工程实验中有用,并且可能有助于识别水平转移事件。