Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
National Human Genome Research Institute, National Institutes of Health, Rockville, MD, USA.
Nat Biotechnol. 2020 Nov;38(11):1347-1355. doi: 10.1038/s41587-020-0538-8. Epub 2020 Jun 15.
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution and comprehensiveness. To help translate these methods to routine research and clinical practice, we developed a sequence-resolved benchmark set for identification of both false-negative and false-positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12,745 isolated, sequence-resolved insertion (7,281) and deletion (5,464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5,262 insertions and 4,095 deletions supported by ≥1 diploid assembly. We demonstrate that the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked- and long-read sequencing and optical mapping.
新技术和分析方法使得基因组结构变异 (SV) 的检测精度、分辨率和全面性不断提高。为了帮助将这些方法转化为常规研究和临床实践,我们开发了一个用于识别种系大片段插入和缺失的假阴性和假阳性的序列解析基准集。为了在个人基因组计划三胞胎中广泛同意的儿子身上创建一个广泛可用的细胞和 DNA 的基准集,基因组瓶联盟整合了来自多种技术的 19 种序列解析变异调用方法。最终的基准集包含 12745 个独立的、序列解析的插入(7281 个)和删除(5464 个)调用,长度≥50 个碱基对(bp)。Tier1 基准区域的任何额外调用都被认为是潜在的假阳性,覆盖了 251 Gbp 和 5262 个插入和 4095 个删除,这些区域得到了至少一个二倍体组装的支持。我们证明了基准集可以可靠地识别短读、链接读和长读测序以及光学图谱中高质量 SV 调用集中的假阴性和假阳性。