Wu Jiayi, Ma Yong-Bei, Congdon Charles, Brett Bevin, Chen Shuobing, Xu Yaofang, Ouyang Qi, Mao Youdong
State Key Laboratory for Artificial Microstructure and Mesoscopic Physics, Institute of Condensed Matter Physics, School of Physics, Center for Quantitative Biology, Peking University, Beijing, China.
Intel Parallel Computing Center for Structural Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America.
PLoS One. 2017 Aug 7;12(8):e0182130. doi: 10.1371/journal.pone.0182130. eCollection 2017.
Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization.
单颗粒冷冻电子显微镜(cryo-EM)数据中的结构异质性是高分辨率结构测定的一个主要挑战。无监督分类可作为评估结构异质性的第一步。然而,传统的无监督分类算法,如K均值聚类和最大似然优化,可能会在图像数据信噪比(SNR)降低时将图像分类到错误的类别中,同时还需要增加计算成本。克服这些限制需要进一步开发用于高性能冷冻电镜数据处理的聚类算法。在此,我们引入一种源自称为生成地形映射(GTM)的统计流形学习框架的无监督单颗粒聚类算法。我们表明,对于较低信噪比的数据,在没有输入参考的情况下,无监督GTM聚类可将分类准确率提高约40%。对几个实验数据集的应用表明,我们的算法可以通过层次聚类策略检测不同类别之间的细微结构差异。在高性能计算(HPC)环境中进行代码优化后,我们的软件实现能够在数小时内以大规模并行方式生成数千个无参考类平均图像,这在从头开始的三维重建方面有显著改进,并有助于对均匀数据集进行计算纯化以实现高分辨率可视化。