Department of Biological Sciences, Purdue University.
Department of Biological Sciences, Purdue University.
J Struct Biol. 2023 Dec;215(4):108041. doi: 10.1016/j.jsb.2023.108041. Epub 2023 Nov 7.
Many macromolecules in biological systems exist in the form of helical polymers. However, the inherent polymorphism and heterogeneity of samples complicate the reconstruction of helical polymers from cryo-EM images. Currently, available 2D classification methods are effective at separating particles of interest from contaminants, but they do not effectively differentiate between polymorphs, resulting in heterogeneity in the 2D classes. As such, it is crucial to develop a method that can computationally divide a dataset of polymorphic helical structures into homogenous subsets. In this work, we utilized deep-learning language models to embed the filaments as vectors in hyperspace and group them into clusters. Tests with both simulated and experimental datasets have demonstrated that our method - HLM (Helical classification with Language Model) can effectively distinguish different types of filaments, in the presence of many contaminants and low signal-to-noise ratios. We also demonstrate that HLM can isolate homogeneous subsets of particles from a publicly available dataset, resulting in the discovery of a previously unreported filament variant with an extra density around the tau filaments.
许多生物系统中的大分子以螺旋聚合物的形式存在。然而,样品的固有多态性和异质性使得从冷冻电镜图像中重建螺旋聚合物变得复杂。目前,可用的 2D 分类方法在从污染物中分离感兴趣的粒子方面非常有效,但它们不能有效地区分多态性,导致 2D 类中的异质性。因此,开发一种能够将多态性螺旋结构的数据集计算地划分为同质子集的方法至关重要。在这项工作中,我们利用深度学习语言模型将纤维嵌入到超空间中作为向量,并将它们分组到聚类中。使用模拟和实验数据集的测试表明,我们的方法——HLM(带语言模型的螺旋分类)可以在存在许多污染物和低信噪比的情况下,有效地区分不同类型的纤维。我们还证明,HLM 可以从公开可用的数据集中分离出同质的粒子子集,从而发现了一种以前未报道的具有 tau 纤维周围额外密度的纤维变体。