Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA.
BMC Bioinformatics. 2018 Dec 21;19(Suppl 18):484. doi: 10.1186/s12859-018-2507-5.
We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes.
We use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features.
Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.
我们研究了使用深度学习技术对生物分子模拟进行聚类的问题。由于生物分子模拟数据集本质上是高维的,因此通常需要构建低维表示,以便从原子机制中提取定量见解,这些机制是复杂生物过程的基础。
我们使用卷积变分自动编码器 (CVAE) 以无监督的方式从长时间尺度的蛋白质折叠模拟中学习低维、具有生物物理意义的潜在特征。我们在三个模型蛋白质折叠系统上证明了我们的方法,即 Fs-肽(14 μs 聚集采样)、绒毛蛋白头部片段(125 μs 的单个轨迹)和 β-β-α(BBA)蛋白(跨越两条独立轨迹的 223 + 102 μs 采样)。在这些系统中,我们表明,CVAE 学习到的潜在特征对应于蛋白质折叠途径中的不同构象亚状态。CVAE 模型平均预测折叠轨迹内近 89%的所有接触正确,同时能够以无监督的方式提取折叠、未折叠和潜在错误折叠状态。此外,CVAE 模型可用于学习可应用于其他独立轨迹的蛋白质折叠的潜在特征,使其特别适合识别与具有相似结构特征的构象亚状态相对应的内在特征。
总的来说,我们表明 CVAE 模型可以定量描述蛋白质折叠等复杂的生物物理过程。