Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden.
Division of Scientific Computing, Department of Information Technology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
PLoS Genet. 2019 Jul 26;15(7):e1008302. doi: 10.1371/journal.pgen.1008302. eCollection 2019 Jul.
Haploid high quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp-reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Most genomic regions we investigated show little to no mapping bias but even a small proportion of sites with bias can impact analyses of those particular loci or slightly skew genome-wide estimates. Therefore, reference bias has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.
单体高质量参考基因组是基因组研究项目中的重要资源。其结果是,携带参考等位基因的 DNA 片段更有可能成功映射,或者获得更高的质量分数。这种参考偏差会对下游的群体基因组分析产生影响,因为杂合位点可能会被错误地视为参考等位基因的纯合子。在人类群体的古基因组研究中,通常使用针对人类参考基因组的映射来识别内源性人类序列。古 DNA 研究通常采用低测序覆盖率,并且 DNA 分子的碎片化会导致很大一部分测序片段短于 50bp,从而减少可接受的错配数量,并增加基因组中多个匹配位点的概率。这些古 DNA 的特有性质可能会加剧参考偏差对下游分析的影响,尤其是由于大多数古人类群体的研究都使用伪单体数据,即它们随机采样每个位点的一个测序读取。我们表明,参考偏差在已发表的史前人类古 DNA 序列数据中普遍存在,不同的基因组区域之间存在一些差异。我们说明参考偏差的强度与片段长度呈负相关。我们研究的大多数基因组区域都没有明显的映射偏差,但即使只有一小部分存在偏差的位点也可能会影响这些特定基因座的分析,或者略微倾斜全基因组估计。因此,参考偏差有可能导致下游分析(例如种群等位基因共享、杂合度估计和古人类血统估计)结果产生微小但显著的差异。这些虚假结果突出表明,了解这些技术伪像非常重要,我们需要策略来减轻其影响。因此,我们建议一些映射后过滤策略来解决参考偏差问题,这有助于大大降低其影响。