Suppr超能文献

使用次模优化选择蛋白质序列数据集的非冗余代表性子集。

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

作者信息

Libbrecht Maxwell W, Bilmes Jeffrey A, Noble William Stafford

机构信息

Department of Genome Sciences, University of Washington, Seattle, Washington.

Department of Electrical Engineering, University of Washington, Seattle, Washington.

出版信息

Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.

Abstract

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

摘要

选择非冗余的代表性序列子集是许多生物信息学工作流程中的常见步骤,例如为序列和结构模型创建非冗余训练集,或从宏基因组学数据中选择“操作分类单元”。以前用于此任务的方法,如CD-HIT、PISCES和UCLUST,应用的是一种基于启发式阈值的算法,没有理论保证。我们提出了一种基于次模优化的新方法。次模优化是连续凸优化的离散类似物,已在其他代表性集选择问题上取得了巨大成功。我们证明,以蛋白质结构域结构的SCOPe库作为金标准,次模优化方法产生的代表性蛋白质序列子集比现有方法选择的子集具有更大的结构多样性。在这种情况下,次模优化始终能产生蛋白质序列子集,这些子集包含的SCOPe结构域家族比竞争方法选择的相同大小的子集更多。我们还展示了优化框架如何使我们能够设计一个混合目标函数,该函数对大、小代表性集都表现良好。我们描述的框架在多项式时间内是最优的(在某些假设下),并且它灵活直观,因为它应用了一套通用方法来优化各种目标函数中的一个。

相似文献

3
Submodular Function Optimization for Motion Clustering and Image Segmentation.用于运动聚类和图像分割的次模函数优化
IEEE Trans Neural Netw Learn Syst. 2019 Sep;30(9):2637-2649. doi: 10.1109/TNNLS.2018.2885591. Epub 2019 Jan 7.
6
Ranking with submodular functions on a budget.预算约束下基于次模函数的排序
Data Min Knowl Discov. 2022;36(3):1197-1218. doi: 10.1007/s10618-022-00833-4. Epub 2022 Apr 23.
8
Submodular Memetic Approximation for Multiobjective Parallel Test Paper Generation.多目标并行试卷生成的次模拟态遗传算法。
IEEE Trans Cybern. 2017 Jun;47(6):1562-1575. doi: 10.1109/TCYB.2016.2552079. Epub 2016 Jun 23.
10
Optimization of molecular representativeness.分子代表性的优化
J Chem Inf Model. 2014 Jun 23;54(6):1567-77. doi: 10.1021/ci400715n. Epub 2014 May 19.

本文引用的文献

6
Search and clustering orders of magnitude faster than BLAST.比 BLAST 快几个数量级的搜索和聚类。
Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.
8
Clustering by passing messages between data points.通过在数据点之间传递信息进行聚类。
Science. 2007 Feb 16;315(5814):972-6. doi: 10.1126/science.1136800. Epub 2007 Jan 11.
10
Spectral clustering of protein sequences.蛋白质序列的谱聚类
Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. doi: 10.1093/nar/gkj515. Print 2006.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验