Suppr超能文献

CEM500K,一个用于深度学习的大规模异质无标签细胞电子显微镜图像数据集。

CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning.

机构信息

Center for Molecular Microscopy, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, United States.

Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, United States.

出版信息

Elife. 2021 Apr 8;10:e65894. doi: 10.7554/eLife.65894.

Abstract

Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 10 unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.

摘要

细胞电子显微镜 (EM) 数据集的自动分割仍然是一个挑战。依赖于感兴趣区域 (ROI) 注释的监督深度学习 (DL) 方法生成的模型无法推广到不相关的数据集。较新的无监督 DL 算法需要相关的预训练图像,但是,在当前可用的 EM 数据集上进行预训练在计算上是昂贵的,并且对于看不见的生物背景几乎没有价值,因为这些数据集很大且同质。为了解决这个问题,我们提出了 CEM500K,这是一个灵活的 25GB 数据集,包含 0.5×10 个独特的 2D 细胞 EM 图像,这些图像是从近 600 个 3D 和 10000 个 2D 图像中提取出来的,这些图像来自于 >100 个不相关的成像项目。我们表明,在 CEM500K 上预训练的模型学习到的特征是具有生物学意义的,并且对有意义的图像增强具有弹性。至关重要的是,我们评估了这些预训练模型在六个公开可用的和一个新衍生的基准分割任务中的迁移学习,并在每个任务上报告了最先进的结果。我们发布了 CEM500K 数据集、预训练模型和模型构建的策展管道,以供 EM 社区进一步扩展。数据和代码可在 https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/https://git.io/JLLTz 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f4ec/8032397/aa7095b2ec0d/elife-65894-fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验