Winder Johanna C, Poulton Simon, Wu Taoyang, Mock Thomas, van Oosterhout Cock
School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK.
School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK.
BMC Biol. 2025 Aug 11;23(1):252. doi: 10.1186/s12915-025-02361-1.
Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification.
Our ANN analysed 50,669 DUF3494 sequences from publicly available metagenomes, and successfully classified a large proportion of sequences by source environment (polar marine, glacier ice, frozen sediment, rock, subsurface). We identified environment-specific features that appear to drive classification. Our best-performing ANN was able to classify between 75.9 and 97.8% of sequences correctly. To enhance biological interpretability of these predictions, we compared this model with a genetic algorithm (GA), which, although it had lower predictive ability, provided transparent classification rules and predictors. Further in silico mutagenesis of key residues uncovered a vertically aligned column of amino acids on the b-face of the protein which was important for environmental differentiation, suggesting that both methods captured distinct evolutionary and ecological aspects of the sequences. Feature importance analysis identified that steric and electronic properties of the protein were associated with predictive ability.
Our findings highlight the utility of deep learning for classification of diverse biological sequences and provide a framework for combining methods to improve model interpretability and ecological insights.
深度学习已成为分析生物数据(包括大型宏基因组数据分析)的强大工具。然而,由于计算成本高、模型复杂性以及从这些人工神经网络(ANN)中提取生物学见解的困难,其应用仍然有限。在本研究中,我们应用了一种迁移学习方法,使用ESM-2蛋白质结构预测模型和我们自己较小的人工神经网络,根据其来源环境对含有未知功能域3494(DUF3494)的蛋白质进行分类。DUF3494存在于原核和真核微生物的一系列环境中的多种假定的冰结合和底物结合蛋白中。它们为探索序列分类中预测准确性和可解释性之间的平衡提供了一个引人注目的测试案例。
我们的人工神经网络分析了来自公开可用宏基因组的50669条DUF3494序列,并成功地根据来源环境(极地海洋、冰川冰、冻土沉积物、岩石、地下)对大部分序列进行了分类。我们确定了似乎驱动分类的环境特异性特征。我们表现最佳的人工神经网络能够正确分类75.9%至97.8%的序列。为了增强这些预测的生物学可解释性,我们将该模型与遗传算法(GA)进行了比较,遗传算法虽然预测能力较低,但提供了透明的分类规则和预测因子。对关键残基的进一步计算机诱变揭示了蛋白质b面上垂直排列的一列氨基酸,这对于环境区分很重要,这表明两种方法都捕捉到了序列不同的进化和生态方面。特征重要性分析确定蛋白质的空间和电子性质与预测能力相关。
我们的研究结果突出了深度学习在分类多样生物序列方面的实用性,并提供了一个结合多种方法以提高模型可解释性和生态见解的框架。