Racimo Fernando, Renaud Gabriel, Slatkin Montgomery
Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America.
Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
PLoS Genet. 2016 Apr 6;12(4):e1005972. doi: 10.1371/journal.pgen.1005972. eCollection 2016 Apr.
When sequencing an ancient DNA sample from a hominin fossil, DNA from present-day humans involved in excavation and extraction will be sequenced along with the endogenous material. This type of contamination is problematic for downstream analyses as it will introduce a bias towards the population of the contaminating individual(s). Quantifying the extent of contamination is a crucial step as it allows researchers to account for possible biases that may arise in downstream genetic analyses. Here, we present an MCMC algorithm to co-estimate the contamination rate, sequencing error rate and demographic parameters-including drift times and admixture rates-for an ancient nuclear genome obtained from human remains, when the putative contaminating DNA comes from present-day humans. We assume we have a large panel representing the putative contaminant population (e.g. European, East Asian or African). The method is implemented in a C++ program called 'Demographic Inference with Contamination and Error' (DICE). We applied it to simulations and genome data from ancient Neanderthals and modern humans. With reasonable levels of genome sequence coverage (>3X), we find we can recover accurate estimates of all these parameters, even when the contamination rate is as high as 50%.
在对古人类化石的古代DNA样本进行测序时,参与挖掘和提取工作的现代人类的DNA会与内源物质一起被测序。这种污染类型对于下游分析来说是个问题,因为它会导致偏向污染个体群体的偏差。量化污染程度是关键的一步,因为这能让研究人员考虑到下游基因分析中可能出现的偏差。在此,我们提出一种马尔可夫链蒙特卡罗(MCMC)算法,用于共同估计从人类遗骸获得的古代核基因组的污染率、测序错误率和人口统计学参数(包括漂变时间和混合率),假定污染DNA来自现代人类。我们假设我们有一个代表假定污染群体(如欧洲人、东亚人或非洲人)的大样本。该方法在一个名为“考虑污染和错误的人口统计学推断”(DICE)的C++程序中实现。我们将其应用于古代尼安德特人和现代人类的模拟数据及基因组数据。在基因组序列覆盖度达到合理水平(>3X)时,我们发现即使污染率高达50%,我们也能够准确估计所有这些参数。