Bioinformatics Institute, Saint Petersburg, Russia.
Department of Genomic Medicine, D. O. Ott Research Institute of Obstetrics, Gynecology, and Reproduction, Saint Petersburg, Russia.
Sci Rep. 2020 Feb 6;10(1):2057. doi: 10.1038/s41598-020-59026-y.
Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated large-scale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3 generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost.
两种最广泛使用的重测序方法——全外显子组测序(WES)和全基因组测序(WGS)的优势和诊断效果经常被争论。由于成本更低、数据存储和处理更容易,WES 主导了大规模重测序项目。第三代测序方法和新型外显子组测序试剂盒的快速发展,需要一个强大的统计框架,以便对新兴方法进行信息丰富且易于比较的性能评估。在我们的研究中,我们开发了一组统计工具,用于系统评估几种现代 WES 平台以及无 PCR WGS 提供的编码区域覆盖度。我们发现,以前大多数发表的比较都没有考虑到短读长的可映射性限制,这存在一个严重的问题。我们使用回归分析和简单的机器学习,以及几种新的覆盖均一度指标,分析了 CDS 覆盖的主要决定因素的贡献。与普遍观点相反,现代 WES 中观察到的大部分偏差主要源于短读长和外显子探针设计的可映射性限制,而不是序列组成。我们还确定了人类外显子中约 500kb 的区域,无法使用短读长技术有效地进行特征描述,在变异分析过程中应特别注意。使用我们新的测序覆盖度指标,我们确定了 WES 和 WGS 性能的主要决定因素。总的来说,我们的研究指出了改进基于富集的方法和开发新方法的途径,这些方法将以最佳成本最大限度地发现变异。