State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
National Institute of Metrology, Beijing, China.
Genome Biol. 2023 Nov 27;24(1):270. doi: 10.1186/s13059-023-03109-2.
Genomic DNA reference materials are widely recognized as essential for ensuring data quality in omics research. However, relying solely on reference datasets to evaluate the accuracy of variant calling results is incomplete, as they are limited to benchmark regions. Therefore, it is important to develop DNA reference materials that enable the assessment of variant detection performance across the entire genome.
We established a DNA reference material suite from four immortalized cell lines derived from a family of parents and monozygotic twins. Comprehensive reference datasets of 4.2 million small variants and 15,000 structural variants were integrated and certified for evaluating the reliability of germline variant calls inside the benchmark regions. Importantly, the genetic built-in-truth of the Quartet family design enables estimation of the precision of variant calls outside the benchmark regions. Using the Quartet reference materials along with study samples, batch effects are objectively monitored and alleviated by training a machine learning model with the Quartet reference datasets to remove potential artifact calls. Moreover, the matched RNA and protein reference materials and datasets from the Quartet project enables cross-omics validation of variant calls from multiomics data.
The Quartet DNA reference materials and reference datasets provide a unique resource for objectively assessing the quality of germline variant calls throughout the whole-genome regions and improving the reliability of large-scale genomic profiling.
基因组 DNA 参考材料被广泛认为是确保组学研究数据质量的重要工具。然而,仅依靠参考数据集来评估变异calling 结果的准确性是不完整的,因为它们仅限于基准区域。因此,开发能够评估整个基因组中变异检测性能的 DNA 参考材料非常重要。
我们从一对父母和一对同卵双胞胎的四个永生化细胞系中建立了一套 DNA 参考材料。综合参考数据集包含 420 万个小型变异和 15000 个结构变异,用于评估基准区域内种系变异calling 的可靠性。重要的是, Quartet 家族设计的遗传固有真实性使得可以估计基准区域之外的变异calling 的精度。使用 Quartet 参考材料以及研究样本,通过使用 Quartet 参考数据集训练机器学习模型,可以客观地监测和减轻批次效应,从而去除潜在的伪变异calling。此外, Quartet 项目的匹配 RNA 和蛋白质参考材料以及数据集可用于对多组学数据中的变异calling 进行跨组学验证。
Quartet DNA 参考材料和参考数据集为客观评估整个基因组区域的种系变异calling 质量以及提高大规模基因组分析的可靠性提供了独特的资源。