Corbett Richard D, Eveleigh Robert, Whitney Joe, Barai Namrata, Bourgey Mathieu, Chuah Eric, Johnson Joanne, Moore Richard A, Moradin Neda, Mungall Karen L, Pereira Sergio, Reuter Miriam S, Thiruvahindrapuram Bhooma, Wintle Richard F, Ragoussis Jiannis, Strug Lisa J, Herbrick Jo-Anne, Aziz Naveed, Jones Steven J M, Lathrop Mark, Scherer Stephen W, Staffa Alfredo, Mungall Andrew J
Canada's Michael Smith Genome Sciences Centre, BC Cancer Research Institute, Provincial Health Services Authority, Vancouver, BC, Canada.
McGill Genome Centre, McGill University, Montreal, QC, Canada.
Front Genet. 2020 Dec 1;11:612515. doi: 10.3389/fgene.2020.612515. eCollection 2020.
Population sequencing often requires collaboration across a distributed network of sequencing centers for the timely processing of thousands of samples. In such massive efforts, it is important that participating scientists can be confident that the accuracy of the sequence data produced is not affected by which center generates the data. A study was conducted across three established sequencing centers, located in Montreal, Toronto, and Vancouver, constituting Canada's Genomics Enterprise (www.cgen.ca). Whole genome sequencing was performed at each center, on three genomic DNA replicates from three well-characterized cell lines. Secondary analysis pipelines employed by each site were applied to sequence data from each of the sites, resulting in three datasets for each of four variables (cell line, replicate, sequencing center, and analysis pipeline), for a total of 81 datasets. These datasets were each assessed according to multiple quality metrics including concordance with benchmark variant truth sets to assess consistent quality across all three conditions for each variable. Three-way concordance analysis of variants across conditions for each variable was performed. Our results showed that the variant concordance between datasets differing only by sequencing center was similar to the concordance for datasets differing only by replicate, using the same analysis pipeline. We also showed that the statistically significant differences between datasets result from the analysis pipeline used, which can be unified and updated as new approaches become available. We conclude that genome sequencing projects can rely on the quality and reproducibility of aggregate data generated across a network of distributed sites.
群体测序通常需要分布在多个测序中心的网络进行协作,以便及时处理数千个样本。在如此大规模的工作中,参与的科学家能够确信所产生的序列数据的准确性不受数据产生中心的影响,这一点很重要。一项研究在位于蒙特利尔、多伦多和温哥华的三个既定测序中心开展,这三个中心构成了加拿大基因组企业(www.cgen.ca)。在每个中心对来自三个特征明确的细胞系的三个基因组DNA复制品进行全基因组测序。每个位点使用的二级分析流程被应用于来自每个位点的序列数据,从而针对四个变量(细胞系、复制品、测序中心和分析流程)中的每一个产生三个数据集,总共81个数据集。这些数据集根据多个质量指标进行评估,包括与基准变异真值集的一致性,以评估每个变量在所有三种条件下的一致质量。对每个变量在不同条件下的变异进行了三方一致性分析。我们的结果表明,仅因测序中心不同而产生的数据集之间的变异一致性与仅因复制品不同而产生的数据集(使用相同分析流程)的一致性相似。我们还表明,数据集之间的统计学显著差异源于所使用的分析流程,随着新方法的出现,这些流程可以统一和更新。我们得出结论,基因组测序项目可以依赖于分布在多个站点的网络所产生的汇总数据的质量和可重复性。