系统评估古 DNA 读段映射。

Systematic benchmark of ancient DNA read mapping.

机构信息

Australian Centre for Ancient DNA, School of Biological Sciences, The University of Adelaide, South Australia, 5005, Australia.

South Australian Museum, Adelaide, SA 5005, Australia.

出版信息

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab076.

DOI:10.1093/bib/bbab076

PMID:33834210

Abstract

The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA 'reads') against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30-80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software-BWA-aln, BWA-mem, NovoAlign and Bowtie2-and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.

摘要

目前，组装个体基因组的标准做法是将数百万个短 DNA 序列（也称为 DNA“reads”）与预先构建的参考基因组进行比对。及时对大量短读取进行映射是一项具有挑战性的计算任务，不可避免地会产生伪影，包括对参考基因组中未发现的等位基因的偏见。这种参考偏差和其他映射伪影预计在古 DNA（aDNA）研究中会更加严重，这些研究依赖于对受损和非常短的 DNA 片段（~30-80 bp）的低数量的分析。尽管如此，近十年来，aDNA 研究的当前黄金标准映射策略几乎没有变化，在此期间，新的软件已经出现。在这项研究中，我们使用来自三个不同人类群体的模拟 aDNA 读取来对四种不同读取映射软件（BWA-aln、BWA-mem、NovoAlign 和 Bowtie2）中的 30 种不同映射策略的性能进行基准测试，并量化了参考偏差在下游群体遗传分析中的影响。我们表明，特定的 NovoAlign、BWA-aln 和 BWA-mem 参数化方法在具有低水平参考偏差的情况下实现了高精度的映射，特别是在过滤掉具有低映射质量的读取之后。然而，无偏差的 NovoAlign 结果需要使用 IUPAC 参考基因组。虽然仅与具有参考人群数据的 aDNA 项目相关，但使用 IUPAC 参考的好处证明了将群体遗传信息纳入 aDNA 映射过程的价值，这与最近基于图基因组表示的结果相呼应。