Suppr超能文献

低温电子显微镜蛋白质颗粒挑选的大型专家标注低温电子显微镜图像数据集(CryoPPP)。

CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking.

作者信息

Dhakal Ashwin, Gyawali Rajan, Wang Liguo, Cheng Jianlin

机构信息

Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, USA. Fax: 573-882-8318.

Laboratory for BioMolecular Structure (LBMS), Brookhaven National Laboratory, Upton, NY 11973, USA.

出版信息

bioRxiv. 2023 Feb 22:2023.02.21.529443. doi: 10.1101/2023.02.21.529443.

Abstract

Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (∼300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp.

摘要

冷冻电子显微镜(cryo-EM)是目前用于确定大型蛋白质复合物和组装体结构的最强大技术。从冷冻电子显微镜显微照片(图像)中挑选单蛋白质颗粒是重建蛋白质结构的关键步骤。然而,广泛使用的基于模板的颗粒挑选过程既费力又耗时。尽管新兴的基于机器学习的颗粒挑选有可能使该过程自动化,但其发展受到缺乏大量高质量、人工标注的训练数据的严重阻碍。在此,我们展示了CryoPPP,这是一个用于单蛋白质颗粒挑选和分析的大型、多样且经过专家整理的冷冻电子显微镜图像数据集,以解决这一瓶颈。它由从电子显微镜公共图像存档库(EMPIAR)中挑选的32个非冗余、代表性蛋白质数据集的人工标注冷冻电子显微镜显微照片组成。它包括9089张多样的高分辨率显微照片(每个EMPIAR数据集约300张冷冻电子显微镜图像),其中蛋白质颗粒的坐标由人类专家标注。蛋白质颗粒标注过程通过二维颗粒分类验证和与金标准的三维密度图验证进行了严格验证。该数据集有望极大地促进用于自动化冷冻电子显微镜蛋白质颗粒挑选的机器学习和人工智能方法的发展。该数据集和数据处理脚本可在https://github.com/BioinfoMachineLearning/cryoppp获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/874e/9980126/30db964071f1/nihpp-2023.02.21.529443v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验