Neuwald Andrew F
The University of Maryland, MD, USA.
Stat Appl Genet Mol Biol. 2011;10(1):Article 36. doi: 10.2202/1544-6115.1666. Epub 2011 Aug 4.
Certain residues have no known function yet are co-conserved across distantly related protein families and diverse organisms, suggesting that they perform critical roles associated with as-yet-unidentified molecular properties and mechanisms. This raises the question of how to obtain additional clues regarding these mysterious biochemical phenomena with a view to formulating experimentally testable hypotheses. One approach is to access the implicit biochemical information encoded within the vast amount of genomic sequence data now becoming available. Here, a new Gibbs sampling strategy is formulated and implemented that can partition hundreds of thousands of sequences within a major protein class into multiple, functionally-divergent categories based on those pattern residues that best discriminate between categories. The sampler precisely defines the partition and pattern for each category by explicitly modeling unrelated, non-functional and related-yet-divergent proteins that would otherwise obscure the analysis. To aid biological interpretation, auxiliary routines can characterize pattern residues within available crystal structures and identify those structures most likely to shed light on the roles of pattern residues. This approach can be used to define and annotate automatically subgroup-specific conserved domain profiles based on statistically-rigorous empirical criteria rather than on the subjective and labor-intensive process of manual curation. Incorporating such profiles into domain database search sites (such as the NCBI BLAST site) will provide biologists with previously inaccessible molecular information useful for hypothesis generation and experimental design. Analyses of P-loop GTPases and of AAA+ ATPases illustrate the sampler's ability to obtain such information.
某些残基尚无已知功能,但在远缘相关的蛋白质家族和多种生物体中共同保守,这表明它们执行与尚未确定的分子特性和机制相关的关键作用。这就提出了一个问题,即如何获得有关这些神秘生化现象的更多线索,以便形成可通过实验验证的假设。一种方法是利用现在可获得的大量基因组序列数据中编码的隐含生化信息。在此,制定并实施了一种新的吉布斯采样策略,该策略可以根据最能区分不同类别的模式残基,将主要蛋白质类别的数十万条序列划分为多个功能不同的类别。采样器通过明确建模不相关、无功能以及相关但有差异的蛋白质(否则会模糊分析),精确地定义了每个类别的划分和模式。为了辅助生物学解释,辅助程序可以对可用晶体结构中的模式残基进行表征,并识别那些最有可能揭示模式残基作用的结构。这种方法可用于基于统计严格的经验标准,而不是基于主观且费力的人工策划过程,自动定义和注释亚组特异性保守结构域概况。将这些概况纳入结构域数据库搜索网站(如NCBI BLAST网站),将为生物学家提供以前无法获得的分子信息,有助于生成假设和进行实验设计。对P环GTP酶和AAA + ATP酶的分析说明了采样器获取此类信息的能力。