Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany.
Bioinformatics. 2022 Aug 2;38(15):3768-3777. doi: 10.1093/bioinformatics/btac390.
Human ancient DNA (aDNA) studies have surged in recent years, revolutionizing the study of the human past. Typically, aDNA is preserved poorly, making such data prone to contamination from other human DNA. Therefore, it is important to rule out substantial contamination before proceeding to downstream analysis. As most aDNA samples can only be sequenced to low coverages (<1× average depth), computational methods that can robustly estimate contamination in the low coverage regime are needed. However, the ultra low-coverage regime (0.1× and below) remains a challenging task for existing approaches.
We present a new method to estimate contamination in aDNA for male modern humans. It utilizes a Li&Stephens haplotype copying model for haploid X chromosomes, with mismatches modeled as errors or contamination. We assessed this new approach, hapCon, on simulated and down-sampled empirical aDNA data. Our experiments demonstrate that hapCon outperforms a commonly used tool for estimating male X contamination (ANGSD), with substantially lower variance and narrower confidence intervals, especially in the low coverage regime. We found that hapCon provides useful contamination estimates for coverages as low as 0.1× for SNP capture data (1240k) and 0.02× for whole genome sequencing data, substantially extending the coverage limit of previous male X chromosome-based contamination estimation methods. Our experiments demonstrate that hapCon has little bias for contamination up to 25-30% as long as the contaminating source is specified within continental genetic variation, and that its application range extends to human aDNA as old as ∼45 000 and various global ancestries.
We make hapCon available as part of a python package (hapROH), which is available at the Python Package Index (https://pypi.org/project/hapROH) and can be installed via pip. The documentation provides example use cases as blueprints for custom applications (https://haproh.readthedocs.io/en/latest/hapCon.html). The program can analyze either BAM files or pileup files produced with samtools. An implementation of our software (hapCon) using Python and C is deposited at https://github.com/hyl317/hapROH.
Supplementary data are available at Bioinformatics online.
人类古 DNA(aDNA)研究近年来蓬勃发展,彻底改变了人类过去的研究方式。通常情况下,aDNA 保存状况不佳,因此此类数据容易受到其他人类 DNA 的污染。因此,在进行下游分析之前,排除大量污染是很重要的。由于大多数 aDNA 样本只能进行低覆盖率(<1×平均深度)测序,因此需要能够在低覆盖率下稳健估计污染的计算方法。然而,超低覆盖率(<0.1×)仍然是现有方法的一个挑战。
我们提出了一种用于估计男性现代人类 aDNA 污染的新方法。它利用 Li&Stephens 单体 X 染色体拷贝模型,将错配模拟为错误或污染。我们在模拟和下采样的经验 aDNA 数据上评估了这种新方法 hapCon。我们的实验表明,hapCon 优于一种常用的男性 X 染色体污染估计工具(ANGSD),其方差和置信区间明显更小,尤其是在低覆盖率下。我们发现,hapCon 为 SNP 捕获数据(1240k)覆盖率低至 0.1×和全基因组测序数据覆盖率低至 0.02×提供了有用的污染估计,大大扩展了以前基于男性 X 染色体的污染估计方法的覆盖率限制。我们的实验表明,只要污染源在大陆遗传变异范围内指定,hapCon 对高达 25-30%的污染就几乎没有偏差,并且其应用范围扩展到了 45000 岁左右的人类 aDNA 和各种全球血统。
我们将 hapCon 作为 Python 包(hapROH)的一部分提供,该包可在 Python 包索引(https://pypi.org/project/hapROH)上获得,并可通过 pip 安装。文档提供了示例使用案例,作为自定义应用程序的蓝图(https://haproh.readthedocs.io/en/latest/hapCon.html)。该程序可以分析 BAM 文件或使用 samtools 生成的 pileup 文件。我们的软件(hapCon)的 Python 和 C 实现已存放在 https://github.com/hyl317/hapROH。
补充数据可在生物信息学在线获得。