Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75275, United States.
Department of Statistical Science, Southern Methodist University, Dallas, Texas 75275, United States.
J Phys Chem B. 2021 May 20;125(19):5022-5034. doi: 10.1021/acs.jpcb.1c02081. Epub 2021 May 11.
Proteins are the molecular machines of life. The multitude of possible conformations that proteins can adopt determines their free-energy landscapes. However, the inherently high dimensionality of a protein free-energy landscape poses a challenge to deciphering how proteins perform their functions. For this reason, dimensionality reduction is an active field of research for molecular biologists. The uniform manifold approximation and projection (UMAP) is a dimensionality reduction method based on a fuzzy topological analysis of data. In the present study, the performance of UMAP is compared with that of other popular dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE), principal component analysis (PCA), and time-structure independent components analysis (tICA) in the context of analyzing molecular dynamics simulations of the circadian clock protein VIVID. A good dimensionality reduction method should accurately represent the data structure on the projected components. The comparison of the raw high-dimensional data with the projections obtained using different dimensionality reduction methods based on various metrics showed that UMAP has superior performance when compared with linear reduction methods (PCA and tICA) and has competitive performance and scalable computational cost.
蛋白质是生命的分子机器。蛋白质可以采用的多种可能构象决定了它们的自由能景观。然而,蛋白质自由能景观固有的高维性给揭示蛋白质如何发挥其功能带来了挑战。出于这个原因,降维是分子生物学家的一个活跃研究领域。一致流形逼近和投影 (UMAP) 是一种基于数据模糊拓扑分析的降维方法。在本研究中,将 UMAP 的性能与其他流行的降维方法(如 t 分布随机邻居嵌入 (t-SNE)、主成分分析 (PCA) 和时间结构独立成分分析 (tICA))进行了比较,用于分析生物钟蛋白 VIVID 的分子动力学模拟。一个好的降维方法应该在投影分量上准确地表示数据结构。使用不同的降维方法基于各种度量对原始高维数据与投影的比较表明,与线性降维方法(PCA 和 tICA)相比,UMAP 具有更好的性能,并且具有竞争力的性能和可扩展的计算成本。