Gao Shan, Zeng Xiangrui, Xu Min, Zhang Fa
High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
University of Chinese Academy of Sciences, Beijing, China.
Front Mol Biosci. 2022 Jul 5;9:931949. doi: 10.3389/fmolb.2022.931949. eCollection 2022.
Cryo-electron tomography (Cryo-ET) is an emerging technology for three-dimensional (3D) visualization of macromolecular structures in the near-native state. To recover structures of macromolecules, millions of diverse macromolecules captured in tomograms should be accurately classified into structurally homogeneous subsets. Although existing supervised deep learning-based methods have improved classification accuracy, such trained models have limited ability to classify novel macromolecules that are unseen in the training stage. To adapt the trained model to the macromolecule classification of a novel class, massive labeled macromolecules of the novel class are needed. However, data labeling is very time-consuming and labor-intensive. In this work, we propose a novel few-shot learning method for the classification of novel macromolecules (named FSCC). A two-stage training strategy is designed in FSCC to enhance the generalization ability of the model to novel macromolecules. First, FSCC uses contrastive learning to pre-train the model on a sufficient number of labeled macromolecules. Second, FSCC uses distribution calibration to re-train the classifier, enabling the model to classify macromolecules of novel classes (unseen class in the pre-training). Distribution calibration transfers learned knowledge in the pre-training stage to novel macromolecules with limited labeled macromolecules of novel class. Experiments were performed on both synthetic and real datasets. On the synthetic datasets, compared with the state-of-the-art (SOTA) method based on supervised deep learning, FSCC achieves competitive performance. To achieve such performance, FSCC only needs five labeled macromolecules per novel class. However, the SOTA method needs 1100 ∼ 1500 labeled macromolecules per novel class. On the real datasets, FSCC improves the accuracy by 5% ∼ 16% when compared to the baseline model. These demonstrate good generalization ability of contrastive learning and calibration distribution to classify novel macromolecules with very few labeled macromolecules.
冷冻电子断层扫描(Cryo-ET)是一种用于近天然状态下大分子结构三维(3D)可视化的新兴技术。为了恢复大分子的结构,在断层扫描中捕获的数百万个不同的大分子应被准确分类为结构上均匀的子集。尽管现有的基于监督深度学习的方法提高了分类准确率,但这种经过训练的模型对训练阶段未见过的新型大分子进行分类的能力有限。为了使训练好的模型适应新类别的大分子分类,需要大量新类别的标记大分子。然而,数据标记非常耗时且劳动强度大。在这项工作中,我们提出了一种用于新型大分子分类的新型少样本学习方法(名为FSCC)。FSCC设计了一种两阶段训练策略,以增强模型对新型大分子的泛化能力。首先,FSCC使用对比学习在足够数量的标记大分子上对模型进行预训练。其次,FSCC使用分布校准对分类器进行重新训练,使模型能够对新类别的大分子(预训练中未见过的类)进行分类。分布校准将预训练阶段学到的知识转移到具有有限新类别标记大分子的新型大分子上。在合成数据集和真实数据集上都进行了实验。在合成数据集上,与基于监督深度学习的最先进(SOTA)方法相比,FSCC取得了有竞争力的性能。为了达到这样的性能,FSCC每个新类别只需要五个标记大分子。然而,SOTA方法每个新类别需要1100至1500个标记大分子。在真实数据集上,与基线模型相比,FSCC的准确率提高了5%至16%。这些结果表明对比学习和校准分布在使用极少标记大分子对新型大分子进行分类方面具有良好的泛化能力。