Dhakal Ashwin, Gyawali Rajan, Wang Liguo, Cheng Jianlin
Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, USA. Fax: 573-882-8318.
Laboratory for BioMolecular Structure (LBMS), Brookhaven National Laboratory, Upton, NY 11973, USA.
bioRxiv. 2023 Feb 22:2023.02.21.529443. doi: 10.1101/2023.02.21.529443.
Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (∼300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp.
冷冻电子显微镜(cryo-EM)是目前用于确定大型蛋白质复合物和组装体结构的最强大技术。从冷冻电子显微镜显微照片(图像)中挑选单蛋白质颗粒是重建蛋白质结构的关键步骤。然而,广泛使用的基于模板的颗粒挑选过程既费力又耗时。尽管新兴的基于机器学习的颗粒挑选有可能使该过程自动化,但其发展受到缺乏大量高质量、人工标注的训练数据的严重阻碍。在此,我们展示了CryoPPP,这是一个用于单蛋白质颗粒挑选和分析的大型、多样且经过专家整理的冷冻电子显微镜图像数据集,以解决这一瓶颈。它由从电子显微镜公共图像存档库(EMPIAR)中挑选的32个非冗余、代表性蛋白质数据集的人工标注冷冻电子显微镜显微照片组成。它包括9089张多样的高分辨率显微照片(每个EMPIAR数据集约300张冷冻电子显微镜图像),其中蛋白质颗粒的坐标由人类专家标注。蛋白质颗粒标注过程通过二维颗粒分类验证和与金标准的三维密度图验证进行了严格验证。该数据集有望极大地促进用于自动化冷冻电子显微镜蛋白质颗粒挑选的机器学习和人工智能方法的发展。该数据集和数据处理脚本可在https://github.com/BioinfoMachineLearning/cryoppp获取。