Department of Computer Science, Faculty of EEMCS, University of Twente, 7522NB Enschede, The Netherlands.
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii29-ii36. doi: 10.1093/bioinformatics/btae385.
Selective sweeps can successfully be distinguished from neutral genetic data using summary statistics and likelihood-based methods that analyze single nucleotide polymorphisms (SNPs). However, these methods are sensitive to confounding factors, such as severe population bottlenecks and old migration. By virtue of machine learning, and specifically convolutional neural networks (CNNs), new accurate classification models that are robust to confounding factors have been recently proposed. However, such methods are more computationally expensive than summary-statistic-based ones, yielding them impractical for processing large-scale genomic data. Moreover, SNP data are frequently preprocessed to improve classification accuracy, further exacerbating the long analysis times.
To this end, we propose a 1D CNN-based model, dubbed FAST-NN, that does not require any preprocessing while using only derived allele frequencies instead of summary statistics or raw SNP data, thereby yielding a sample-size-invariant, scalable solution. We evaluated several data fusion approaches to account for the variance of the density of genetic diversity across genomic regions (a selective sweep signature), and performed an extensive neural architecture search based on a state-of-the-art reference network architecture (SweepNet). The resulting model, FAST-NN, outperforms the reference architecture by up to 12% inference accuracy over all challenging evolutionary scenarios with confounding factors that were evaluated. Moreover, FAST-NN is between 30× and 259× faster on a single CPU core, and between 2.0× and 6.2× faster on a GPU, when processing sample sizes between 128 and 1000 samples. Our work paves the way for the practical use of CNNs in large-scale selective sweep detection.
使用汇总统计和基于似然的方法可以成功区分选择清除与中性遗传数据,这些方法分析单核苷酸多态性 (SNP)。然而,这些方法容易受到混杂因素的影响,例如严重的种群瓶颈和古老的迁移。由于机器学习,特别是卷积神经网络 (CNN),最近提出了新的、对混杂因素具有鲁棒性的准确分类模型。然而,与基于汇总统计的方法相比,这些方法的计算成本更高,因此对于处理大规模基因组数据来说不切实际。此外,SNP 数据通常经过预处理以提高分类准确性,这进一步加剧了长分析时间。
为此,我们提出了一种基于 1D CNN 的模型,称为 FAST-NN,它不需要任何预处理,而只使用衍生等位基因频率,而不是汇总统计或原始 SNP 数据,从而产生样本大小不变、可扩展的解决方案。我们评估了几种数据融合方法来解释基因组区域内遗传多样性密度的变化(选择清除签名),并根据最先进的参考网络架构 (SweepNet) 进行了广泛的神经架构搜索。所得到的模型 FAST-NN 在所有具有混杂因素的具有挑战性的进化场景中的推断准确性都比参考架构高出高达 12%。此外,当处理大小在 128 到 1000 个样本之间的样本时,FAST-NN 在单个 CPU 核上的速度快 30 到 259 倍,在 GPU 上的速度快 2.0 到 6.2 倍。我们的工作为 CNN 在大规模选择清除检测中的实际应用铺平了道路。