Ravishankar Shyamsundar, Perez Vilma, Davidson Roberta, Roca-Rada Xavier, Lan Divon, Souilmi Yassine, Llamas Bastien
Australian Centre for Ancient DNA (ACAD) and The Environment Institute, The School of Biological Sciences, University of Adelaide, Adelaide, SA, Australia.
Centre of Excellence for Australian Biodiversity and Heritage, University of Adelaide, Adelaide, SA, Australia.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae646.
Contamination with exogenous DNA presents a significant challenge in ancient DNA (aDNA) studies of single organisms. Failure to address contamination from microbes, reagents, and present-day sources can impact the interpretation of results. Although field and laboratory protocols exist to limit contamination, there is still a need to accurately distinguish between endogenous and exogenous data computationally. Here, we propose a workflow to reduce exogenous contamination based on a metagenomic classifier. Unlike previous methods that relied exclusively on DNA sequencing reads mapping specificity to a single reference genome to remove contaminating reads, our approach uses Kraken2-based filtering before mapping to the reference genome. Using both simulated and empirical shotgun aDNA data, we show that this workflow presents a simple and efficient method that can be used in a wide range of computational environments-including personal machines. We propose strategies to build specific databases used to profile sequencing data that take into consideration available computational resources and prior knowledge about the target taxa and likely contaminants. Our workflow significantly reduces the overall computational resources required during the mapping process and reduces the total runtime by up to ~94%. The most significant impacts are observed in low endogenous samples. Importantly, contaminants that would map to the reference are filtered out using our strategy, reducing false positive alignments. We also show that our method results in a negligible loss of endogenous data with no measurable impact on downstream population genetics analyses.
外源性DNA污染在单个生物体的古DNA(aDNA)研究中是一个重大挑战。未能解决来自微生物、试剂和现代来源的污染会影响结果的解释。尽管存在野外和实验室规程来限制污染,但仍需要通过计算准确区分内源性和外源性数据。在此,我们提出一种基于宏基因组分类器的减少外源性污染的工作流程。与以往仅依靠DNA测序读数对单个参考基因组的映射特异性来去除污染读数的方法不同,我们的方法在映射到参考基因组之前使用基于Kraken2的过滤。使用模拟和经验性鸟枪法aDNA数据,我们表明该工作流程提供了一种简单有效的方法,可用于包括个人计算机在内的广泛计算环境。我们提出了构建用于分析测序数据的特定数据库的策略,该策略考虑了可用的计算资源以及关于目标分类群和可能污染物的先验知识。我们的工作流程显著减少了映射过程中所需的总体计算资源,并将总运行时间减少了高达约94%。在低内源性样本中观察到最显著的影响。重要的是,使用我们的策略可以滤除会映射到参考基因组的污染物,减少假阳性比对。我们还表明,我们的方法导致内源性数据的损失可忽略不计,对下游群体遗传学分析没有可测量的影响。