高通量比较建模策略：在结构基因组学和蛋白质家族组织中用于杠杆分析的应用

Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization.

作者信息

Mirkovic Nebojsa, Li Zhaohui, Parnassa Andrew, Murray Diana

机构信息

Department of Microbiology and Immunology, Weill Medical College of Cornell University, New York, New York 10021, USA.

出版信息

Proteins. 2007 Mar 1;66(4):766-77. doi: 10.1002/prot.21191.

DOI:10.1002/prot.21191

PMID:17154423

Abstract

The technological breakthroughs in structural genomics were designed to facilitate the solution of a sufficient number of structures, so that as many protein sequences as possible can be structurally characterized with the aid of comparative modeling. The leverage of a solved structure is the number and quality of the models that can be produced using the structure as a template for modeling and may be viewed as the "currency" with which the success of a structural genomics endeavor can be measured. Moreover, the models obtained in this way should be valuable to all biologists. To this end, at the Northeast Structural Genomics Consortium (NESG), a modular computational pipeline for automated high-throughput leverage analysis was devised and used to assess the leverage of the 186 unique NESG structures solved during the first phase of the Protein Structure Initiative (January 2000 to July 2005). Here, the results of this analysis are presented. The number of sequences in the nonredundant protein sequence database covered by quality models produced by the pipeline is approximately 39,000, so that the average leverage is approximately 210 models per structure. Interestingly, only 7900 of these models fulfill the stringent modeling criterion of being at least 30% sequence-identical to the corresponding NESG structures. This study shows how high-throughput modeling increases the efficiency of structure determination efforts by providing enhanced coverage of protein structure space. In addition, the approach is useful in refining the boundaries of structural domains within larger protein sequences, subclassifying sequence diverse protein families, and defining structure-based strategies specific to a particular family.

摘要

结构基因组学的技术突破旨在推动解决足够数量的结构，以便借助比较建模对尽可能多的蛋白质序列进行结构表征。已解析结构的影响力在于能够以该结构为模板生成的模型数量和质量，这可被视为衡量结构基因组学工作成功与否的“货币”。此外，以这种方式获得的模型对所有生物学家都应具有价值。为此，在东北结构基因组学联盟（NESG），设计了一个模块化计算流程用于自动高通量影响力分析，并将其用于评估在蛋白质结构计划第一阶段（2000年1月至2005年7月）解析出的186个NESG独特结构的影响力。在此，展示了该分析的结果。该流程生成的高质量模型所覆盖的非冗余蛋白质序列数据库中的序列数量约为39,000个，因此平均每个结构的影响力约为210个模型。有趣的是，这些模型中只有7900个符合与相应NESG结构序列一致性至少为30%的严格建模标准。这项研究表明高通量建模如何通过扩大蛋白质结构空间的覆盖范围来提高结构确定工作的效率。此外，该方法有助于细化较大蛋白质序列中结构域的边界，对序列多样的蛋白质家族进行亚分类，以及定义特定家族特有的基于结构的策略。