Neuwald Andrew F
Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine , Baltimore, Maryland.
J Comput Biol. 2014 Mar;21(3):269-86. doi: 10.1089/cmb.2013.0099. Epub 2014 Feb 4.
The process of identifying and modeling functionally divergent subgroups for a specific protein domain class and arranging these subgroups hierarchically has, thus far, largely been done via manual curation. How to accomplish this automatically and optimally is an unsolved statistical and algorithmic problem that is addressed here via Markov chain Monte Carlo sampling. Taking as input a (typically very large) multiple-sequence alignment, the sampler creates and optimizes a hierarchy by adding and deleting leaf nodes, by moving nodes and subtrees up and down the hierarchy, by inserting or deleting internal nodes, and by redefining the sequences and conserved patterns associated with each node. All such operations are based on a probability distribution that models the conserved and divergent patterns defining each subgroup. When we view these patterns as sequence determinants of protein function, each node or subtree in such a hierarchy corresponds to a subgroup of sequences with similar biological properties. The sampler can be applied either de novo or to an existing hierarchy. When applied to 60 protein domains from multiple starting points in this way, it converged on similar solutions with nearly identical log-likelihood ratio scores, suggesting that it typically finds the optimal peak in the posterior probability distribution. Similarities and differences between independently generated, nearly optimal hierarchies for a given domain help distinguish robust from statistically uncertain features. Thus, a future application of the sampler is to provide confidence measures for various features of a domain hierarchy.
到目前为止,针对特定蛋白质结构域类别识别功能不同的亚组并对这些亚组进行层次排列的过程,在很大程度上是通过人工整理完成的。如何自动且最优地完成这项工作是一个尚未解决的统计和算法问题,本文通过马尔可夫链蒙特卡罗采样来解决。采样器以一个(通常非常大的)多序列比对作为输入,通过添加和删除叶节点、在层次结构中上下移动节点和子树、插入或删除内部节点以及重新定义与每个节点相关的序列和保守模式,来创建和优化一个层次结构。所有这些操作都基于一个概率分布,该分布对定义每个亚组的保守和发散模式进行建模。当我们将这些模式视为蛋白质功能的序列决定因素时,这样一个层次结构中的每个节点或子树都对应于具有相似生物学特性的序列亚组。采样器既可以从头应用,也可以应用于现有的层次结构。当以这种方式从多个起始点应用于60个蛋白质结构域时,它收敛于具有几乎相同对数似然比分数的相似解决方案,这表明它通常在后验概率分布中找到最优峰值。给定结构域的独立生成的、近乎最优的层次结构之间的异同有助于区分稳健特征和统计上不确定的特征。因此,采样器未来的一个应用是为结构域层次结构的各种特征提供置信度度量。