Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA.
Mol Biol Evol. 2023 Oct 4;40(10). doi: 10.1093/molbev/msad216.
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
推断适应性事件对于了解特征很重要,例如人类在婴儿期后消化乳糖和病毒变体的快速传播。早期从基因组数据中识别自然选择痕迹的努力涉及到总结统计和似然方法的开发。然而,这些技术基于简单的模式或理论模型,限制了它们可以探索的复杂程度。由于人工智能的复兴,机器学习方法在最近检测自然选择的努力中占据了中心地位,例如卷积神经网络应用于单倍型图像。然而,这些技术的局限性包括在非凸环境下估计大量模型参数以及在不考虑图像内位置的情况下进行特征识别。另一种方法是使用张量分解从多维数据中提取特征,尽管保留了数据的潜在结构,并将这些特征提供给机器学习模型。在这里,我们采用了这个框架,并提出了一种新的方法,称为 T-REx,它使用张量分解从采样个体的单倍型图像中提取特征,然后使用经典的机器学习方法从这些特征进行预测。作为一个概念验证,我们探索了 T-REx 在模拟中性和选择清扫场景中的性能,发现它具有很高的区分清扫和中性的能力、对常见技术障碍的鲁棒性以及特征重要性的易于可视化。因此,T-REx 是从基因组数据中检测适应性过程的工具包的有力补充。