Parton Daniel L, Grinaway Patrick B, Hanson Sonya M, Beauchamp Kyle A, Chodera John D
Computational Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York, United States of America.
Graduate Program in Physiology, Biophysics, and Systems Biology, Weill Cornell Medical College, New York, New York, United States of America.
PLoS Comput Biol. 2016 Jun 23;12(6):e1004728. doi: 10.1371/journal.pcbi.1004728. eCollection 2016 Jun.
The rapidly expanding body of available genomic and protein structural data provides a rich resource for understanding protein dynamics with biomolecular simulation. While computational infrastructure has grown rapidly, simulations on an omics scale are not yet widespread, primarily because software infrastructure to enable simulations at this scale has not kept pace. It should now be possible to study protein dynamics across entire (super)families, exploiting both available structural biology data and conformational similarities across homologous proteins. Here, we present a new tool for enabling high-throughput simulation in the genomics era. Ensembler takes any set of sequences-from a single sequence to an entire superfamily-and shepherds them through various stages of modeling and refinement to produce simulation-ready structures. This includes comparative modeling to all relevant PDB structures (which may span multiple conformational states of interest), reconstruction of missing loops, addition of missing atoms, culling of nearly identical structures, assignment of appropriate protonation states, solvation in explicit solvent, and refinement and filtering with molecular simulation to ensure stable simulation. The output of this pipeline is an ensemble of structures ready for subsequent molecular simulations using computer clusters, supercomputers, or distributed computing projects like Folding@home. Ensembler thus automates much of the time-consuming process of preparing protein models suitable for simulation, while allowing scalability up to entire superfamilies. A particular advantage of this approach can be found in the construction of kinetic models of conformational dynamics-such as Markov state models (MSMs)-which benefit from a diverse array of initial configurations that span the accessible conformational states to aid sampling. We demonstrate the power of this approach by constructing models for all catalytic domains in the human tyrosine kinase family, using all available kinase catalytic domain structures from any organism as structural templates. Ensembler is free and open source software licensed under the GNU General Public License (GPL) v2. It is compatible with Linux and OS X. The latest release can be installed via the conda package manager, and the latest source can be downloaded from https://github.com/choderalab/ensembler.
快速增长的可用基因组和蛋白质结构数据,为通过生物分子模拟理解蛋白质动力学提供了丰富资源。尽管计算基础设施发展迅速,但组学规模的模拟尚未广泛开展,主要原因是支持这种规模模拟的软件基础设施未能跟上步伐。现在应该能够利用现有的结构生物学数据和同源蛋白质间的构象相似性,研究整个(超)家族的蛋白质动力学。在此,我们展示了一种在基因组学时代实现高通量模拟的新工具。Ensembler可处理任何序列集——从单条序列到整个超家族——并引导它们历经建模和优化的各个阶段,以生成可用于模拟的结构。这包括与所有相关PDB结构(可能涵盖多个感兴趣的构象状态)进行比较建模、缺失环的重建、缺失原子的添加、近乎相同结构的剔除、适当质子化状态的指定、在显式溶剂中的溶剂化,以及通过分子模拟进行优化和筛选,以确保模拟的稳定性。该流程的输出是一组可供后续使用计算机集群、超级计算机或像Folding@home这样的分布式计算项目进行分子模拟的结构。因此,Ensembler自动化了许多准备适合模拟的蛋白质模型的耗时过程,并允许扩展到整个超家族。这种方法的一个特别优势体现在构象动力学动力学模型(如马尔可夫状态模型(MSM))的构建中,这些模型受益于跨越可及构象状态的各种初始构型,以辅助采样。我们通过使用来自任何生物体的所有可用激酶催化结构域结构作为结构模板,构建人类酪氨酸激酶家族中所有催化结构域的模型,展示了这种方法的强大功能。Ensembler是根据GNU通用公共许可证(GPL)v2授权的免费开源软件。它与Linux和OS X兼容。最新版本可通过conda包管理器安装,最新源代码可从https://github.com/choderalab/ensembler下载。