Mu John C, Tootoonchi Afshar Pegah, Mohiyuddin Marghoob, Chen Xi, Li Jian, Bani Asadi Narges, Gerstein Mark B, Wong Wing H, Lam Hugo Y K
Bina Technologies, Roche Sequencing, Redwood City, CA 94065, USA.
Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
Sci Rep. 2015 Sep 28;5:14493. doi: 10.1038/srep14493.
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.
一个高可信度、全面的人类变异数据集对于评估测序算法的准确性至关重要,而测序算法在基于高通量测序的精准医学中起着关键作用。尽管近期的研究试图提供这样一种资源,但它们仍然没有涵盖包括结构变异(SVs)在内的所有主要变异类型。因此,我们利用来自HuRef基因组的大量高质量桑格测序数据,构建了迄今为止单个个体最全面的金标准数据集,并通过深度Illumina测序、群体数据集和成熟算法进行了交叉验证。由于HuRef基因组先前公布的变异大多是在五年前报道的,存在兼容性、组织性和准确性问题,无法直接用于基准测试,因此完全重新分析HuRef基因组是必要的。我们广泛的分析和验证产生了一个具有高特异性和敏感性的金标准数据集。与目前的NA12878或HS1011基因组金标准数据集不同,我们的金标准数据集是第一个包含小变异、长达十万碱基对的缺失SVs和插入SVs的数据集。我们展示了我们的HuRef金标准数据集在对几种已发表的SV检测工具进行基准测试方面的实用性。