Weisenfeld Neil I, Yin Shuangye, Sharpe Ted, Lau Bayo, Hegarty Ryan, Holmes Laurie, Sogoloff Brian, Tabbaa Diana, Williams Louise, Russ Carsten, Nusbaum Chad, Lander Eric S, MacCallum Iain, Jaffe David B
The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.
Nat Genet. 2014 Dec;46(12):1350-5. doi: 10.1038/ng.3121. Epub 2014 Oct 19.
Complete knowledge of the genetic variation in individual human genomes is a crucial foundation for understanding the etiology of disease. Genetic variation is typically characterized by sequencing individual genomes and comparing reads to a reference. Existing methods do an excellent job of detecting variants in approximately 90% of the human genome; however, calling variants in the remaining 10% of the genome (largely low-complexity sequence and segmental duplications) is challenging. To improve variant calling, we developed a new algorithm, DISCOVAR, and examined its performance on improved, low-cost sequence data. Using a newly created reference set of variants from the finished sequence of 103 randomly chosen fosmids, we find that some standard variant call sets miss up to 25% of variants. We show that the combination of new methods and improved data increases sensitivity by several fold, with the greatest impact in challenging regions of the human genome.
全面了解个体人类基因组中的遗传变异是理解疾病病因的关键基础。遗传变异通常通过对个体基因组进行测序并将读取结果与参考序列进行比较来表征。现有方法在检测人类基因组约90%的变异方面表现出色;然而,在基因组其余10%(主要是低复杂性序列和片段重复)中识别变异具有挑战性。为了改进变异识别,我们开发了一种新算法DISCOVAR,并在改进的低成本序列数据上检验了其性能。使用从103个随机选择的fosmid的完整序列中新建的变异参考集,我们发现一些标准变异识别集遗漏了高达25%的变异。我们表明,新方法与改进数据的结合将灵敏度提高了几倍,对人类基因组中具有挑战性的区域影响最大。