Arnab Sandipan Paul, Dos Santos Andre Luiz Campelo, Fumagalli Matteo, DeGiorgio Michael
Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA.
School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK.
bioRxiv. 2025 Mar 6:2025.03.05.641710. doi: 10.1101/2025.03.05.641710.
Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pre-trained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed , a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps.
自然选择在基因组内留下了可检测到的空间多样性改变模式,识别受影响的区域对于理解物种进化至关重要。最近,已开发出应用于原始群体基因组数据的机器学习方法来揭示这些适应性特征。卷积神经网络(CNN)在这项任务中特别有效,因为它们在处理大数据阵列时能保持元素间的相关性。然而,浅层CNN由于其有限的能力可能会错过复杂模式,而深层CNN虽能捕捉这些模式,但需要大量数据和计算能力。迁移学习通过利用在大型数据集上预训练的深层CNN作为下游分类和进化参数预测的特征提取工具来应对这些挑战。这种方法在保持高性能的同时,减少了对大量训练数据生成的需求和计算需求。在本研究中,我们开发了 ,这是一种利用迁移学习从多位点变异的图像表示中增强对适应性基因组区域检测的工具。除了未分型数据和其他混杂因素外,我们在各种遗传、人口统计学和适应性设置下对 进行了评估。与使用类似数据表示的近期方法相比, 证明了对适应性区域的检测有所改进。我们通过类激活映射进一步探索了模型的可解释性,并调整 以推断已识别的适应性候选者的选择参数。使用来自欧洲和非洲人群的全基因组单倍型数据, 有效地概括了已知的扫描候选者,并将新的癌症及其他与疾病相关的基因识别为潜在的扫描对象。