Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland.
Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
Bioinformatics. 2020 Feb 1;36(3):828-841. doi: 10.1093/bioinformatics/btz660.
The presence of present-day human contaminating DNA fragments is one of the challenges defining ancient DNA (aDNA) research. This is especially relevant to the ancient human DNA field where it is difficult to distinguish endogenous molecules from human contaminants due to their genetic similarity. Recently, with the advent of high-throughput sequencing and new aDNA protocols, hundreds of ancient human genomes have become available. Contamination in those genomes has been measured with computational methods often developed specifically for these empirical studies. Consequently, some of these methods have not been implemented and tested for general use while few are aimed at low-depth nuclear data, a common feature in aDNA datasets.
We develop a new X-chromosome-based maximum likelihood method for estimating present-day human contamination in low-depth sequencing data from male individuals. We implement our method for general use, assess its performance under conditions typical of ancient human DNA research, and compare it to previous nuclear data-based methods through extensive simulations. For low-depth data, we show that existing methods can produce unusable estimates or substantially underestimate contamination. In contrast, our method provides accurate estimates for a depth of coverage as low as 0.5× on the X-chromosome when contamination is below 25%. Moreover, our method still yields meaningful estimates in very challenging situations, i.e. when the contaminant and the target come from closely related populations or with increased error rates. With a running time below 5 min, our method is applicable to large scale aDNA genomic studies.
The method is implemented in C++ and R and is available in github.com/sapfo/contaminationX and popgen.dk/angsd.
现今人类污染 DNA 片段的存在是定义古代 DNA(aDNA)研究的挑战之一。这在古人类 DNA 领域尤为相关,由于其遗传相似性,很难区分内源性分子和人类污染物。最近,随着高通量测序和新的 aDNA 方案的出现,数百个人类古代基因组已经可用。这些基因组中的污染已经通过计算方法进行了测量,这些方法通常是为这些经验研究专门开发的。因此,其中一些方法尚未被实施和测试以供一般使用,而少数方法则针对低深度核数据,这是 aDNA 数据集的一个常见特征。
我们开发了一种新的基于 X 染色体的最大似然方法,用于估计来自男性个体的低深度测序数据中现今人类的污染。我们为一般用途实现了我们的方法,评估了在典型的古人类 DNA 研究条件下的性能,并通过广泛的模拟将其与以前的基于核数据的方法进行了比较。对于低深度数据,我们表明现有的方法可能会产生不可用的估计值或大大低估污染。相比之下,当污染低于 25%时,我们的方法可以在 X 染色体的覆盖率低至 0.5×的情况下提供准确的估计值。此外,我们的方法在非常具有挑战性的情况下仍然可以产生有意义的估计值,即在污染物和目标来自密切相关的群体或具有更高错误率的情况下。我们的方法的运行时间低于 5 分钟,适用于大规模的 aDNA 基因组研究。
该方法用 C++和 R 实现,并可在 github.com/sapfo/contaminationX 和 popgen.dk/angsd 中获得。