Yılmaz Mehmet Alper, Ceylan Ahmet Arda, Kaynar Gun, Çiçek A Ercüment
Department of Computer Engineering, Bilkent University, Ankara 06800, Türkiye.
Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 06800, United States.
Bioinformatics. 2025 Jul 1;41(Supplement_1):i285-i293. doi: 10.1093/bioinformatics/btaf244.
Copy number variants (CNVs) are pivotal in driving phenotypic variation that facilitates species adaptation. They are significant contributors to various disorders, making ancient genomes crucial for uncovering the genetic origins of disease susceptibility across populations. However, detecting CNVs in ancient DNA (aDNA) samples poses substantial challenges due to several factors: (i) aDNA is often highly degraded; (ii) contamination from microbial DNA and DNA from closely related species introduces additional noise into sequencing data; and finally, (iii) the typically low-coverage of aDNA renders accurate CNV detection particularly difficult. Conventional CNV calling algorithms, which are optimized for high-coverage read-depth signals, underperform under such conditions.
To address these limitations, we introduce LYCEUM, the first machine learning-based CNV caller for aDNA. To overcome challenges related to data quality and scarcity, we employ a two-step training strategy. First, the model is pre-trained on whole genome sequencing data from the 1000 Genomes Project, teaching it CNV-calling capabilities similar to conventional methods. Next, the model is fine-tuned using high-confidence CNV calls derived from only a few existing high-coverage aDNA samples. During this stage, the model adapts to making CNV calls based on the downsampled read depth signals of the same aDNA samples. LYCEUM achieves accurate detection of CNVs even in typically low-coverage ancient genomes. We also observe that the segmental deletion calls made by LYCEUM show correlation with the demographic history of the samples and exhibit patterns of negative selection inline with natural selection.
LYCEUM is available at https://github.com/ciceklab/LYCEUM.
拷贝数变异(CNV)在驱动促进物种适应的表型变异中起着关键作用。它们是导致各种疾病的重要因素,使得古代基因组对于揭示不同人群疾病易感性的遗传起源至关重要。然而,由于以下几个因素,在古代DNA(aDNA)样本中检测CNV面临重大挑战:(i)aDNA通常高度降解;(ii)微生物DNA和来自密切相关物种的DNA污染会给测序数据引入额外噪声;最后,(iii)aDNA通常的低覆盖率使得准确检测CNV特别困难。针对高覆盖率读深度信号进行优化的传统CNV检测算法在这种情况下表现不佳。
为了解决这些限制,我们引入了LYCEUM,这是首个基于机器学习的aDNA CNV检测工具。为了克服与数据质量和稀缺性相关的挑战,我们采用了两步训练策略。首先,该模型在来自千人基因组计划的全基因组测序数据上进行预训练,使其具备与传统方法类似的CNV检测能力。接下来,使用仅从少数现有的高覆盖率aDNA样本中获得的高置信度CNV调用对模型进行微调。在此阶段,模型适应基于相同aDNA样本的下采样读深度信号进行CNV调用。即使在通常低覆盖率的古代基因组中,LYCEUM也能准确检测CNV。我们还观察到,LYCEUM做出的片段缺失调用与样本的人口历史相关,并呈现出与自然选择一致的负选择模式。