Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO, 65211, USA.
Laboratory for BioMolecular Structure (LBMS), Brookhaven National Laboratory, Upton, NY, 11973, USA.
Sci Data. 2023 Jun 22;10(1):392. doi: 10.1038/s41597-023-02280-2.
Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of biological macromolecular complexes. Picking single-protein particles from cryo-EM micrographs is a crucial step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) based particle picking can potentially automate the process, its development is hindered by lack of large, high-quality labelled training data. To address this bottleneck, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for protein particle picking and analysis. It consists of labelled cryo-EM micrographs (images) of 34 representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). The dataset is 2.6 terabytes and includes 9,893 high-resolution micrographs with labelled protein particle coordinates. The labelling process was rigorously validated through 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of both AI and classical methods for automated cryo-EM protein particle picking.
冷冻电镜(cryo-EM)是确定生物大分子复合物结构的强大技术。从 cryo-EM 显微照片中挑选单个蛋白质颗粒是重建蛋白质结构的关键步骤。然而,广泛使用的基于模板的颗粒挑选过程既费力又耗时。尽管基于机器学习和人工智能(AI)的颗粒挑选有可能实现自动化,但由于缺乏大型、高质量的标记训练数据,其发展受到阻碍。为了解决这个瓶颈,我们提出了 CryoPPP,这是一个用于蛋白质颗粒挑选和分析的大型、多样、经过专家整理的 cryo-EM 图像数据集。它由从 Electron Microscopy Public Image Archive (EMPIAR) 中选择的 34 个代表性蛋白质数据集的标记 cryo-EM 显微照片(图像)组成。该数据集为 2.6 太字节,包含 9893 个具有标记蛋白质颗粒坐标的高分辨率显微照片。标记过程通过 2D 颗粒分类验证和与黄金标准的 3D 密度图验证进行了严格验证。该数据集有望极大地促进用于自动 cryo-EM 蛋白质颗粒挑选的 AI 和经典方法的发展。