使用次模优化识别蛋白质家族的代表性序列。

Identifying representative sequences of protein families using submodular optimization.

作者信息

Nguyen Ha, Nguyen Hung, Nguyen Phuong, Luu Anh N, Cantu David C, Nguyen Tin

机构信息

Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, 36849, USA.

Kenneth P. Dietrich School of Arts & Sciences, University of Pittsburgh, Pittsburgh, PA, 15260, USA.

出版信息

Sci Rep. 2025 Jan 7;15(1):1069. doi: 10.1038/s41598-025-85165-1.

DOI:10.1038/s41598-025-85165-1

PMID:39774134

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11707169/

Abstract

Identifying representative sequences for groups of functionally similar proteins and enzymes poses significant computational challenges. In this study, we applied submodular optimization, a method effective in data summarization, to select representative sequences for thioesterase enzyme families. We introduced and validated two algorithms, Greedy and Bidirectional Greedy, using curated protein sequence data from the ThYme (Thioester-active enzYmes) database. Both algorithms generated sequence subsets that preserved completeness (inclusion of all known family sequences) and specificity (accurate family representation). The Greedy algorithm outperformed the Bidirectional Greedy algorithm and other methods, particularly in reducing redundancy. Our study offers an efficient approach for identifying representative protein sequences within families that have significant sequence similarity, likely to deliver results close to theoretical optima in polynomial time, with the potential to improve the selection and optimization of representative sequences in protein databases.

摘要

识别功能相似的蛋白质和酶组的代表性序列带来了重大的计算挑战。在本研究中，我们应用了子模块优化（一种在数据汇总中有效的方法）来选择硫酯酶酶家族的代表性序列。我们使用来自ThYme（硫酯活性酶）数据库的经过整理的蛋白质序列数据，引入并验证了两种算法，即贪心算法和双向贪心算法。两种算法都生成了保留完整性（包含所有已知家族序列）和特异性（准确的家族代表性）的序列子集。贪心算法优于双向贪心算法和其他方法，特别是在减少冗余方面。我们的研究提供了一种有效的方法，用于识别具有显著序列相似性的家族内的代表性蛋白质序列，有可能在多项式时间内产生接近理论最优值的结果，具有改善蛋白质数据库中代表性序列的选择和优化的潜力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用次模优化识别蛋白质家族的代表性序列。

Identifying representative sequences of protein families using submodular optimization.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

使用次模优化识别蛋白质家族的代表性序列。

Identifying representative sequences of protein families using submodular optimization.

作者信息

机构信息

出版信息

相似文献

本文引用的文献