使用可解释的优化集成学习框架预测酶的功能。

Prediction of enzyme function using an interpretable optimized ensemble learning framework.

作者信息

Dhibar Saikat, Basak Sumon, Jana Biman

机构信息

School of Chemical Sciences, Indian Association for the Cultivation of Science Jadavpur Kolkata-700032 India

出版信息

Chem Sci. 2025 Sep 1. doi: 10.1039/d5sc04513d.

DOI:10.1039/d5sc04513d

PMID:40951780

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12424445/

Abstract

Accurate prediction of enzyme function, particularly for newly discovered uncharacterized sequences, is immensely important for modern biological research. Recently, machine learning (ML) based methods have shown promise. However, such tools often suffer from complexity in feature extraction, interpretability, and generalization ability. In this study, we construct a dataset for enzyme functions and present an interpretable ML method, SOLVE (Soft-Voting Optimized Learning for Versatile Enzymes), that addresses these issues by using only combinations of tokenized subsequences from the protein's primary sequence for classification. SOLVE utilizes an ensemble learning framework integrating random forest (RF), light gradient boosting machine (LightGBM) and decision tree (DT) models with an optimized weighted strategy, which enhances prediction accuracy, distinguishes enzymes from non-enzymes, and predicts enzyme commission (EC) numbers for mono- and multi-functional enzymes. The focal loss penalty in SOLVE effectively mitigates class imbalance, refining functional annotation accuracy. Additionally, SOLVE provides interpretability through Shapley analyses, identifying functional motifs at catalytic and allosteric sites of enzymes. By leveraging only primary sequence data, SOLVE streamlines high-throughput enzyme function prediction for functionally uncharacterized sequences and outperforms existing tools across all evaluation metrics on independent datasets. With its high prediction accuracy and ability to identify functional regions, SOLVE can become a promising tool in different fields of biology and therapeutic drug design.

摘要

准确预测酶的功能，特别是对于新发现的未表征序列，对现代生物学研究极为重要。最近，基于机器学习（ML）的方法已显示出前景。然而，此类工具通常在特征提取、可解释性和泛化能力方面存在复杂性。在本研究中，我们构建了一个酶功能数据集，并提出了一种可解释的ML方法SOLVE（多功能酶的软投票优化学习），该方法通过仅使用来自蛋白质一级序列的分词子序列组合进行分类来解决这些问题。SOLVE利用集成学习框架，将随机森林（RF）、轻梯度提升机（LightGBM）和决策树（DT）模型与优化的加权策略相结合，提高了预测准确性，区分了酶与非酶，并预测了单功能和多功能酶的酶委员会（EC）编号。SOLVE中的焦点损失惩罚有效地减轻了类别不平衡，提高了功能注释的准确性。此外，SOLVE通过Shapley分析提供可解释性，识别酶的催化和变构位点的功能基序。通过仅利用一级序列数据，SOLVE简化了对功能未表征序列的高通量酶功能预测，并在独立数据集的所有评估指标上优于现有工具。凭借其高预测准确性和识别功能区域的能力，SOLVE可以成为生物学和治疗药物设计不同领域中有前景的工具。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用可解释的优化集成学习框架预测酶的功能。

Prediction of enzyme function using an interpretable optimized ensemble learning framework.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

使用可解释的优化集成学习框架预测酶的功能。

Prediction of enzyme function using an interpretable optimized ensemble learning framework.

作者信息

机构信息

出版信息

相似文献

本文引用的文献