DNAnexus, 1975 W El Camino Real #204, Mountain View, CA 94040, USA.
Department of Computer Science, 3400 N. Charles St. Johns Hopkins University, Baltimore, MD 21218, USA.
Gigascience. 2020 Dec 21;9(12). doi: 10.1093/gigascience/giaa145.
Structural variants (SVs) are critical contributors to genetic diversity and genomic disease. To predict the phenotypic impact of SVs, there is a need for better estimates of both the occurrence and frequency of SVs, preferably from large, ethnically diverse cohorts. Thus, the current standard approach requires the use of short paired-end reads, which remain challenging to detect, especially at the scale of hundreds to thousands of samples.
We present Parliament2, a consensus SV framework that leverages multiple best-in-class methods to identify high-quality SVs from short-read DNA sequence data at scale. Parliament2 incorporates pre-installed SV callers that are optimized for efficient execution in parallel to reduce the overall runtime and costs. We demonstrate the accuracy of Parliament2 when applied to data from NovaSeq and HiSeq X platforms with the Genome in a Bottle (GIAB) SV call set across all size classes. The reported quality score per SV is calibrated across different SV types and size classes. Parliament2 has the highest F1 score (74.27%) measured across the independent gold standard from GIAB. We illustrate the compute performance by processing all 1000 Genomes samples (2,691 samples) in <1 day on GRCH38. Parliament2 improves the runtime performance of individual methods and is open source (https://github.com/slzarate/parliament2), and a Docker image, as well as a WDL implementation, is available.
Parliament2 provides both a highly accurate single-sample SV call set from short-read DNA sequence data and enables cost-efficient application over cloud or cluster environments, processing thousands of samples.
结构变异(SVs)是遗传多样性和基因组疾病的关键贡献者。为了预测 SVs 的表型影响,需要更好地估计 SVs 的发生和频率,最好是从大型、种族多样化的队列中获得。因此,目前的标准方法需要使用短的配对末端读取,这仍然难以检测,特别是在数百到数千个样本的规模上。
我们提出了 Parliament2,这是一种共识 SV 框架,利用多种同类最佳的方法从短读 DNA 序列数据中大规模识别高质量的 SV。Parliament2 整合了预先安装的 SV 调用程序,这些程序经过优化,可以在并行环境中高效执行,以减少总体运行时和成本。我们展示了 Parliament2 在基因组瓶(GIAB)SV 调用集的所有大小类别的 NovaSeq 和 HiSeq X 平台数据上的准确性。每个 SV 的报告质量分数在不同的 SV 类型和大小类别的校准。Parliament2 在 GIAB 的独立金标准中具有最高的 F1 分数(74.27%)。我们通过在 GRCH38 上在<1 天内处理所有 1000 个基因组样本(2691 个样本)来说明计算性能。Parliament2 提高了单个方法的运行时性能,并且是开源的(https://github.com/slzarate/parliament2),还提供了 Docker 镜像和 WDL 实现。
Parliament2 提供了从短读 DNA 序列数据中高度准确的单个样本 SV 调用集,并能够在云或集群环境中实现具有成本效益的应用,处理数千个样本。