Suppr超能文献

使用次模优化选择蛋白质序列数据集的非冗余代表性子集。

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

作者信息

Libbrecht Maxwell W, Bilmes Jeffrey A, Noble William Stafford

机构信息

Department of Genome Sciences, University of Washington, Seattle, Washington.

Department of Electrical Engineering, University of Washington, Seattle, Washington.

出版信息

Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.

Abstract

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

摘要

选择非冗余的代表性序列子集是许多生物信息学工作流程中的常见步骤,例如为序列和结构模型创建非冗余训练集,或从宏基因组学数据中选择“操作分类单元”。以前用于此任务的方法,如CD-HIT、PISCES和UCLUST,应用的是一种基于启发式阈值的算法,没有理论保证。我们提出了一种基于次模优化的新方法。次模优化是连续凸优化的离散类似物,已在其他代表性集选择问题上取得了巨大成功。我们证明,以蛋白质结构域结构的SCOPe库作为金标准,次模优化方法产生的代表性蛋白质序列子集比现有方法选择的子集具有更大的结构多样性。在这种情况下,次模优化始终能产生蛋白质序列子集,这些子集包含的SCOPe结构域家族比竞争方法选择的相同大小的子集更多。我们还展示了优化框架如何使我们能够设计一个混合目标函数,该函数对大、小代表性集都表现良好。我们描述的框架在多项式时间内是最优的(在某些假设下),并且它灵活直观,因为它应用了一套通用方法来优化各种目标函数中的一个。

相似文献

1
Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.
Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.
3
Submodular Function Optimization for Motion Clustering and Image Segmentation.
IEEE Trans Neural Netw Learn Syst. 2019 Sep;30(9):2637-2649. doi: 10.1109/TNNLS.2018.2885591. Epub 2019 Jan 7.
4
Maximizing Submodular Functions under Matroid Constraints by Evolutionary Algorithms.
Evol Comput. 2015 Winter;23(4):543-58. doi: 10.1162/EVCO_a_00159. Epub 2015 Jul 2.
6
Ranking with submodular functions on a budget.
Data Min Knowl Discov. 2022;36(3):1197-1218. doi: 10.1007/s10618-022-00833-4. Epub 2022 Apr 23.
8
Submodular Memetic Approximation for Multiobjective Parallel Test Paper Generation.
IEEE Trans Cybern. 2017 Jun;47(6):1562-1575. doi: 10.1109/TCYB.2016.2552079. Epub 2016 Jun 23.
9
Choosing panels of genomics assays using submodular optimization.
Genome Biol. 2016 Nov 15;17(1):229. doi: 10.1186/s13059-016-1089-7.
10
Optimization of molecular representativeness.
J Chem Inf Model. 2014 Jun 23;54(6):1567-77. doi: 10.1021/ci400715n. Epub 2014 May 19.

引用本文的文献

1
Identifying representative sequences of protein families using submodular optimization.
Sci Rep. 2025 Jan 7;15(1):1069. doi: 10.1038/s41598-025-85165-1.
2
Quantifying microbial guilds.
ISME Commun. 2024 Mar 27;4(1):ycae042. doi: 10.1093/ismeco/ycae042. eCollection 2024 Jan.
3
Reference flow: reducing reference bias using multiple population genomes.
Genome Biol. 2021 Jan 4;22(1):8. doi: 10.1186/s13059-020-02229-3.
4
Enhancing statistical power in temporal biomarker discovery through representative shapelet mining.
Bioinformatics. 2020 Dec 30;36(Suppl_2):i840-i848. doi: 10.1093/bioinformatics/btaa815.
5
Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data.
Bioinformatics. 2021 May 1;37(4):439-447. doi: 10.1093/bioinformatics/btaa830.

本文引用的文献

1
Choosing panels of genomics assays using submodular optimization.
Genome Biol. 2016 Nov 15;17(1):229. doi: 10.1186/s13059-016-1089-7.
2
kClust: fast and sensitive clustering of large protein sequence databases.
BMC Bioinformatics. 2013 Aug 15;14:248. doi: 10.1186/1471-2105-14-248.
3
Maximising the size of non-redundant protein datasets using graph theory.
PLoS One. 2013;8(2):e55484. doi: 10.1371/journal.pone.0055484. Epub 2013 Feb 5.
4
Structure, function and diversity of the healthy human microbiome.
Nature. 2012 Jun 13;486(7402):207-14. doi: 10.1038/nature11234.
5
Protein sequence redundancy reduction: comparison of various method.
Bioinformation. 2010 Nov 27;5(6):234-9. doi: 10.6026/97320630005234.
6
Search and clustering orders of magnitude faster than BLAST.
Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.
8
Clustering by passing messages between data points.
Science. 2007 Feb 16;315(5814):972-6. doi: 10.1126/science.1136800. Epub 2007 Jan 11.
9
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.
Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.
10
Spectral clustering of protein sequences.
Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. doi: 10.1093/nar/gkj515. Print 2006.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验