Department of Information Technology, Uppsala University, Uppsala, Sweden.
Department of Information Technology, Uppsala University, Uppsala, Sweden; Vironova AB, Gävlegatan 22, Stockholm, Sweden.
Comput Methods Programs Biomed. 2021 Sep;209:106318. doi: 10.1016/j.cmpb.2021.106318. Epub 2021 Jul 29.
BACKGROUND AND OBJECTIVE: To achieve the full potential of deep learning (DL) models, such as understanding the interplay between model (size), training strategy, and amount of training data, researchers and developers need access to new dedicated image datasets; i.e., annotated collections of images representing real-world problems with all their variations, complexity, limitations, and noise. Here, we present, describe and make freely available an annotated transmission electron microscopy (TEM) image dataset. It constitutes an interesting challenge for many practical applications in virology and epidemiology; e.g., virus detection, segmentation, classification, and novelty detection. We also present benchmarking results for virus detection and recognition using some of the top-performing (large and small) networks as well as a handcrafted very small network. We compare and evaluate transfer learning and training from scratch hypothesizing that with a limited dataset, transfer learning is crucial for good performance of a large network whereas our handcrafted small network performs relatively well when training from scratch. This is one step towards understanding how much training data is needed for a given task. METHODS: The benchmark dataset contains 1245 images of 22 virus classes. We propose a representative data split into training, validation, and test sets for this dataset. Moreover, we compare different established DL networks and present a baseline DL solution for classifying a subset of the 14 most-represented virus classes in the dataset. RESULTS: Our best model, DenseNet201 pre-trained on ImageNet and fine-tuned on the training set, achieved a 0.921 F1-score and 93.1% accuracy on the proposed representative test set. CONCLUSIONS: Public and real biomedical datasets are an important contribution and a necessity to increase the understanding of shortcomings, requirements, and potential improvements for deep learning solutions on biomedical problems or deploying solutions in clinical settings. We compared transfer learning to learning from scratch on this dataset and hypothesize that for limited-sized datasets transfer learning is crucial for achieving good performance for large models. Last but not least, we demonstrate the importance of application knowledge in creating datasets for training DL models and analyzing their results.
背景与目的:为了充分发挥深度学习(DL)模型的潜力,例如理解模型(大小)、训练策略和训练数据量之间的相互作用,研究人员和开发人员需要访问新的专用图像数据集;即,代表具有所有变化、复杂性、局限性和噪声的真实世界问题的图像的注释集合。在这里,我们提出、描述并免费提供一个带注释的透射电子显微镜(TEM)图像数据集。它为病毒学和流行病学中的许多实际应用构成了一个有趣的挑战;例如,病毒检测、分割、分类和新颖性检测。我们还展示了使用一些表现最佳(大、小)网络以及手工制作的非常小网络进行病毒检测和识别的基准测试结果。我们比较和评估了迁移学习和从头开始训练,假设在数据集有限的情况下,迁移学习对于大型网络的良好性能至关重要,而我们手工制作的小型网络在从头开始训练时表现相对较好。这是朝着理解给定任务需要多少训练数据迈出的一步。
方法:基准数据集包含 22 种病毒类别的 1245 张图像。我们为该数据集提出了一个具有代表性的训练、验证和测试集数据划分。此外,我们比较了不同的成熟 DL 网络,并提出了用于对数据集中的一小部分 14 个代表性最强的病毒类进行分类的基线 DL 解决方案。
结果:我们最好的模型是在 ImageNet 上预训练的 DenseNet201,在训练集上进行微调,在我们提出的代表性测试集上达到了 0.921 的 F1 分数和 93.1%的准确率。
结论:公共和真实的生物医学数据集是增加对深度学习解决方案在生物医学问题上的局限性、要求和潜在改进的理解以及在临床环境中部署解决方案的重要贡献和必要条件。我们在这个数据集上比较了迁移学习和从头开始学习,并假设对于有限大小的数据集,迁移学习对于实现大型模型的良好性能至关重要。最后但同样重要的是,我们证明了在创建用于训练 DL 模型的数据集并分析其结果时应用知识的重要性。
Comput Methods Programs Biomed. 2021-9
Comput Med Imaging Graph. 2019-5-18
Comput Biol Med. 2021-9
Comput Methods Programs Biomed. 2017-3