Libbrecht Maxwell W, Bilmes Jeffrey A, Noble William Stafford
Department of Genome Sciences, University of Washington, Seattle, Washington.
Department of Electrical Engineering, University of Washington, Seattle, Washington.
Proteins. 2018 Apr;86(4):454-466. doi: 10.1002/prot.25461. Epub 2018 Feb 1.
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
选择非冗余的代表性序列子集是许多生物信息学工作流程中的常见步骤,例如为序列和结构模型创建非冗余训练集,或从宏基因组学数据中选择“操作分类单元”。以前用于此任务的方法,如CD-HIT、PISCES和UCLUST,应用的是一种基于启发式阈值的算法,没有理论保证。我们提出了一种基于次模优化的新方法。次模优化是连续凸优化的离散类似物,已在其他代表性集选择问题上取得了巨大成功。我们证明,以蛋白质结构域结构的SCOPe库作为金标准,次模优化方法产生的代表性蛋白质序列子集比现有方法选择的子集具有更大的结构多样性。在这种情况下,次模优化始终能产生蛋白质序列子集,这些子集包含的SCOPe结构域家族比竞争方法选择的相同大小的子集更多。我们还展示了优化框架如何使我们能够设计一个混合目标函数,该函数对大、小代表性集都表现良好。我们描述的框架在多项式时间内是最优的(在某些假设下),并且它灵活直观,因为它应用了一套通用方法来优化各种目标函数中的一个。