使用次模优化选择蛋白质序列数据集的非冗余代表性子集。

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

作者信息

Libbrecht Maxwell W, Bilmes Jeffrey A, Noble William Stafford

机构信息

Department of Genome Sciences, University of Washington, Seattle, Washington.

Department of Electrical Engineering, University of Washington, Seattle, Washington.

出版信息

Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.

DOI:10.1002/prot.25461

PMID:29345009

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5835207/

Abstract

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

摘要

选择非冗余的代表性序列子集是许多生物信息学工作流程中的常见步骤，例如为序列和结构模型创建非冗余训练集，或从宏基因组学数据中选择“操作分类单元”。以前用于此任务的方法，如CD-HIT、PISCES和UCLUST，应用的是一种基于启发式阈值的算法，没有理论保证。我们提出了一种基于次模优化的新方法。次模优化是连续凸优化的离散类似物，已在其他代表性集选择问题上取得了巨大成功。我们证明，以蛋白质结构域结构的SCOPe库作为金标准，次模优化方法产生的代表性蛋白质序列子集比现有方法选择的子集具有更大的结构多样性。在这种情况下，次模优化始终能产生蛋白质序列子集，这些子集包含的SCOPe结构域家族比竞争方法选择的相同大小的子集更多。我们还展示了优化框架如何使我们能够设计一个混合目标函数，该函数对大、小代表性集都表现良好。我们描述的框架在多项式时间内是最优的（在某些假设下），并且它灵活直观，因为它应用了一套通用方法来优化各种目标函数中的一个。

相似文献

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.使用次模优化选择蛋白质序列数据集的非冗余代表性子集。

Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.

Multiobjective Evolutionary Algorithms Are Still Good: Maximizing Monotone Approximately Submodular Minus Modular Functions.多目标进化算法仍然出色：最大化单调近似次模减模函数

Evol Comput. 2021 Dec 1;29(4):463-490. doi: 10.1162/evco_a_00288.

Submodular Function Optimization for Motion Clustering and Image Segmentation.用于运动聚类和图像分割的次模函数优化

IEEE Trans Neural Netw Learn Syst. 2019 Sep;30(9):2637-2649. doi: 10.1109/TNNLS.2018.2885591. Epub 2019 Jan 7.

Maximizing Submodular Functions under Matroid Constraints by Evolutionary Algorithms.基于进化算法的拟阵约束下子模函数最大化

Evol Comput. 2015 Winter;23(4):543-58. doi: 10.1162/EVCO_a_00159. Epub 2015 Jul 2.

Optimizing Monotone Chance-Constrained Submodular Functions Using Evolutionary Multi-Objective Algorithms.使用进化多目标算法优化单调机会约束次模函数。

Evol Comput. 2024 Sep 24:1-35. doi: 10.1162/evco_a_00360.

Ranking with submodular functions on a budget.预算约束下基于次模函数的排序

Data Min Knowl Discov. 2022;36(3):1197-1218. doi: 10.1007/s10618-022-00833-4. Epub 2022 Apr 23.

Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks.通过最大化基于图神经网络学习的描述符的子模函数来选择具有多样结构和性质的分子。

Sci Rep. 2022 Jan 21;12(1):1124. doi: 10.1038/s41598-022-04967-9.

Submodular Memetic Approximation for Multiobjective Parallel Test Paper Generation.多目标并行试卷生成的次模拟态遗传算法。

IEEE Trans Cybern. 2017 Jun;47(6):1562-1575. doi: 10.1109/TCYB.2016.2552079. Epub 2016 Jun 23.

Choosing panels of genomics assays using submodular optimization.使用次模优化选择基因组学检测面板。

Genome Biol. 2016 Nov 15;17(1):229. doi: 10.1186/s13059-016-1089-7.

Optimization of molecular representativeness.分子代表性的优化

J Chem Inf Model. 2014 Jun 23;54(6):1567-77. doi: 10.1021/ci400715n. Epub 2014 May 19.

引用本文的文献

Identifying representative sequences of protein families using submodular optimization.使用次模优化识别蛋白质家族的代表性序列。

Sci Rep. 2025 Jan 7;15(1):1069. doi: 10.1038/s41598-025-85165-1.

Quantifying microbial guilds.量化微生物群落

ISME Commun. 2024 Mar 27;4(1):ycae042. doi: 10.1093/ismeco/ycae042. eCollection 2024 Jan.

Reference flow: reducing reference bias using multiple population genomes.参考文献流向：利用多个群体基因组减少参考文献偏差。

Genome Biol. 2021 Jan 4;22(1):8. doi: 10.1186/s13059-020-02229-3.

Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.通过代表性形状子挖掘提高时间生物标志物发现中的统计功效。

Bioinformatics. 2020 Dec 30;36(Suppl_2):i840-i848. doi: 10.1093/bioinformatics/btaa815.

Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data.利用一种利用估算数据的优化策略来优先考虑转录组学和表观基因组学实验。

Bioinformatics. 2021 May 1;37(4):439-447. doi: 10.1093/bioinformatics/btaa830.

Submodular Maximization via Gradient Ascent: The Case of Deep Submodular Functions.通过梯度上升进行次模最大化：深度次模函数的情况

Adv Neural Inf Process Syst. 2018 Dec;2018:7989-7999.

本文引用的文献

Choosing panels of genomics assays using submodular optimization.使用次模优化选择基因组学检测面板。

Genome Biol. 2016 Nov 15;17(1):229. doi: 10.1186/s13059-016-1089-7.

kClust: fast and sensitive clustering of large protein sequence databases.kClust：快速且灵敏的大规模蛋白质序列数据库聚类程序。

BMC Bioinformatics. 2013 Aug 15;14:248. doi: 10.1186/1471-2105-14-248.

Maximising the size of non-redundant protein datasets using graph theory.利用图论最大化非冗余蛋白质数据集的规模。

PLoS One. 2013;8(2):e55484. doi: 10.1371/journal.pone.0055484. Epub 2013 Feb 5.

Structure, function and diversity of the healthy human microbiome.健康人体微生物组的结构、功能与多样性。

Nature. 2012 Jun 13;486(7402):207-14. doi: 10.1038/nature11234.

Protein sequence redundancy reduction: comparison of various method.蛋白质序列冗余减少：各种方法的比较

Bioinformation. 2010 Nov 27;5(6):234-9. doi: 10.6026/97320630005234.

Search and clustering orders of magnitude faster than BLAST.比 BLAST 快几个数量级的搜索和聚类。

Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.

A multi-model approach to simultaneous segmentation and classification of heterogeneous populations of cell nuclei in 3D confocal microscope images.一种用于在三维共聚焦显微镜图像中对异质细胞核群体进行同时分割和分类的多模型方法。

Cytometry A. 2007 Sep;71(9):724-36. doi: 10.1002/cyto.a.20430.

Clustering by passing messages between data points.通过在数据点之间传递信息进行聚类。

Science. 2007 Feb 16;315(5814):972-6. doi: 10.1126/science.1136800. Epub 2007 Jan 11.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Cd-hit：一个用于对大量蛋白质或核苷酸序列进行聚类和比较的快速程序。

Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.

Spectral clustering of protein sequences.蛋白质序列的谱聚类

Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. doi: 10.1093/nar/gkj515. Print 2006.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。