European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genomes Campus, Cambridge, Cambridgeshire, United Kingdom.
PLoS Comput Biol. 2013;9(12):e1003382. doi: 10.1371/journal.pcbi.1003382. Epub 2013 Dec 12.
The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.
1000 基因组计划数据为人类种系氨基酸突变提供了自然的背景数据集。由于突变的方向是已知的,因此从观察到的核苷酸变异生成的氨基酸交换矩阵是不对称的,不同氨基酸的突变率也非常不同。这些差异主要反映了 DNA 中核苷酸突变的偏好(尤其是 CpG 二核苷酸的高突变率,这使得精氨酸的突变率远高于其他氨基酸),而不是由蛋白质结构限制施加的选择,尽管后者也有证据。变异主要发生在蛋白质表面(82%),略微倾向于比随机更暴露和更不保守的部位。功能残基的突变发生的频率大约是随机预期的一半。OMIM 中的疾病相关氨基酸变体分布与基于 1000 基因组数据集的预期完全不同。与 1000 基因组突变相比,疾病相关变体优先发生在更保守的部位。许多氨基酸交换图谱似乎表现出反相关,一个数据集中的常见交换在另一个数据集中很少见。疾病相关变体在氨基酸大小和疏水性方面表现出更极端的差异。需要更多在核苷酸水平上对突变过程进行建模,但这些观察结果应该有助于提高对人类特定变体影响的预测。