Khaja Razi, Zhang Junjun, MacDonald Jeffrey R, He Yongshu, Joseph-George Ann M, Wei John, Rafiq Muhammad A, Qian Cheng, Shago Mary, Pantano Lorena, Aburatani Hiroyuki, Jones Keith, Redon Richard, Hurles Matthew, Armengol Lluis, Estivill Xavier, Mural Richard J, Lee Charles, Scherer Stephen W, Feuk Lars
Program in Genetics and Genomic Biology, The Hospital for Sick Children and Department of Molecular and Medical Genetics, University of Toronto and The Centre for Applied Genomics, MaRS Centre, Toronto, Ontario, M5G 1L7, Canada.
Nat Genet. 2006 Dec;38(12):1413-8. doi: 10.1038/ng1921. Epub 2006 Nov 22.
Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.
存在多种类型的DNA变异,从单核苷酸多态性(SNP)到更大的结构改变,如拷贝数变异(CNV)和倒位。来自不同来源的DNA序列比对已被用于识别SNP和中等大小的变异(ISV)。然而,仅鉴定了总异质性的一小部分,对于大多数较小尺寸(<50 kb)变异的特征了解甚少。在此我们表明,基因组组装比较是识别所有类型遗传变异的一种可靠方法。通过比较两个人类基因组组装(赛雷拉公司的R27c汇编和构建35参考序列),我们鉴定出了在一个组装中缺失、倒位或多态的兆碱基序列(以13534个假定的非SNP事件形式)。数据库比较和实验室实验进一步证明了240个可变区域的重叠或验证,并确认了超过150万个SNP。一些差异是简单的插入和缺失,但在包含CNV、片段重复和重复DNA的区域,差异更为复杂。我们的结果揭示了人类中大量未描述的变异,突出了全面注释策略对于充分解读基因组扫描和个性化测序项目的必要性。