Dhibar Saikat, Basak Sumon, Jana Biman
School of Chemical Sciences, Indian Association for the Cultivation of Science Jadavpur Kolkata-700032 India
Chem Sci. 2025 Sep 1. doi: 10.1039/d5sc04513d.
Accurate prediction of enzyme function, particularly for newly discovered uncharacterized sequences, is immensely important for modern biological research. Recently, machine learning (ML) based methods have shown promise. However, such tools often suffer from complexity in feature extraction, interpretability, and generalization ability. In this study, we construct a dataset for enzyme functions and present an interpretable ML method, SOLVE (Soft-Voting Optimized Learning for Versatile Enzymes), that addresses these issues by using only combinations of tokenized subsequences from the protein's primary sequence for classification. SOLVE utilizes an ensemble learning framework integrating random forest (RF), light gradient boosting machine (LightGBM) and decision tree (DT) models with an optimized weighted strategy, which enhances prediction accuracy, distinguishes enzymes from non-enzymes, and predicts enzyme commission (EC) numbers for mono- and multi-functional enzymes. The focal loss penalty in SOLVE effectively mitigates class imbalance, refining functional annotation accuracy. Additionally, SOLVE provides interpretability through Shapley analyses, identifying functional motifs at catalytic and allosteric sites of enzymes. By leveraging only primary sequence data, SOLVE streamlines high-throughput enzyme function prediction for functionally uncharacterized sequences and outperforms existing tools across all evaluation metrics on independent datasets. With its high prediction accuracy and ability to identify functional regions, SOLVE can become a promising tool in different fields of biology and therapeutic drug design.
准确预测酶的功能,特别是对于新发现的未表征序列,对现代生物学研究极为重要。最近,基于机器学习(ML)的方法已显示出前景。然而,此类工具通常在特征提取、可解释性和泛化能力方面存在复杂性。在本研究中,我们构建了一个酶功能数据集,并提出了一种可解释的ML方法SOLVE(多功能酶的软投票优化学习),该方法通过仅使用来自蛋白质一级序列的分词子序列组合进行分类来解决这些问题。SOLVE利用集成学习框架,将随机森林(RF)、轻梯度提升机(LightGBM)和决策树(DT)模型与优化的加权策略相结合,提高了预测准确性,区分了酶与非酶,并预测了单功能和多功能酶的酶委员会(EC)编号。SOLVE中的焦点损失惩罚有效地减轻了类别不平衡,提高了功能注释的准确性。此外,SOLVE通过Shapley分析提供可解释性,识别酶的催化和变构位点的功能基序。通过仅利用一级序列数据,SOLVE简化了对功能未表征序列的高通量酶功能预测,并在独立数据集的所有评估指标上优于现有工具。凭借其高预测准确性和识别功能区域的能力,SOLVE可以成为生物学和治疗药物设计不同领域中有前景的工具。