Briskine Roman V, Shimizu Kentaro K
Department of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, Zurich, CH-8057, Switzerland.
Functional Genomics Center Zurich, Winterthurerstrasse 190, Zurich, CH-8057, Switzerland.
BMC Genomics. 2017 Mar 28;18(1):263. doi: 10.1186/s12864-017-3637-2.
Whole genome resequencing projects may implement variant calling using draft reference genomes assembled de novo from short-read libraries. Despite lower quality of such assemblies, they allowed researchers to extend a wide range of population genetic and genome-wide association analyses to non-model species. As the variant calling pipelines are complex and involve many software packages, it is important to understand inherent biases and limitations at each step of the analysis.
In this article, we report a positional bias present in variant calling performed against draft reference assemblies constructed from de Bruijn or string overlap graphs. We assessed how frequently variants appeared at each position counted from ends of a contig or scaffold sequence, and discovered unexpectedly high number of variants at the positions related to the length of either k-mers or reads used for the assembly. We detected the bias in both publicly available draft assemblies from Assemblathon 2 competition as well as in the assemblies we generated from our simulated short-read data. Simulations confirmed that the bias causing variants are predominantly false positives induced by reads from spatially distant repeated sequences. The bias is particularly strong in contig assemblies. Scaffolding does not eliminate the bias but tends to mitigate it because of the changes in variants' relative positions and alterations in read alignments. The bias can be effectively reduced by filtering out the variants that reside in repetitive elements.
Draft genome sequences generated by several popular assemblers appear to be susceptible to the positional bias potentially affecting many resequencing projects in non-model species. The bias is inherent to the assembly algorithms and arises from their particular handling of repeated sequences. It is recommended to reduce the bias by filtering especially if higher-quality genome assembly cannot be achieved. Our findings can help other researchers to improve the quality of their variant data sets and reduce artefactual findings in downstream analyses.
全基因组重测序项目可以使用从短读长文库中从头组装的草图参考基因组来进行变异检测。尽管此类组装的质量较低,但它们使研究人员能够将广泛的群体遗传学和全基因组关联分析扩展到非模式物种。由于变异检测流程复杂且涉及许多软件包,了解分析每个步骤中固有的偏差和局限性非常重要。
在本文中,我们报告了在针对由de Bruijn或字符串重叠图构建的草图参考组装进行变异检测时存在的位置偏差。我们评估了变异在从重叠群或支架序列末端计数的每个位置出现的频率,并发现与用于组装的k-mer或读长长度相关的位置上出现了意外高数量的变异。我们在Assemblathon 2竞赛的公开可用草图组装以及我们从模拟短读长数据生成的组装中都检测到了这种偏差。模拟证实,导致变异的偏差主要是由来自空间上遥远的重复序列的读长诱导的假阳性。这种偏差在重叠群组装中尤为强烈。构建支架并不能消除偏差,但由于变异的相对位置变化和读长比对的改变,往往会减轻偏差。通过过滤掉位于重复元件中的变异,可以有效地减少这种偏差。
由几种流行的组装器生成的草图基因组序列似乎容易受到位置偏差的影响,这可能会影响许多非模式物种的重测序项目。这种偏差是组装算法固有的,源于它们对重复序列的特殊处理。建议通过过滤来减少偏差,特别是在无法获得更高质量的基因组组装时。我们的发现可以帮助其他研究人员提高其变异数据集的质量,并减少下游分析中的伪发现。