Suppr超能文献

展示全幻灯片图像文件Babel fish的框架:一种基于光学字符识别的文件标注工具。

Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool.

作者信息

Englert Nils, Schwab Constantin, Legnar Maximilian, Weis Cleo-Aron

机构信息

Section Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, Germany.

Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, Germany.

出版信息

J Pathol Inform. 2024 Oct 23;15:100402. doi: 10.1016/j.jpi.2024.100402. eCollection 2024 Dec.

Abstract

INTRODUCTION

Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems.

MATERIAL AND METHODS

The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides.

RESULTS

For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982.

CONCLUSION

The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.

摘要

引言

从数字化切片或全切片图像文件中提取元数据是一项频繁、费力且繁琐的任务。在本研究中,我们展示了一种工具,可从扫描切片的宏观图像中自动提取所有相关切片信息,如病例编号、年份、切片编号、组织块编号和染色信息。我们将该工具命名为“巴别鱼”,因为它有助于翻译印在切片上的相关信息。它的编写基于某些基本假设,例如某些信息的位置。这可以根据各自的位置进行调整。然后,提取的元数据可用于将数字切片分类到数据库中,或将它们与实验室信息系统中的相关病例ID进行链接。

材料与方法

该工具基于光学字符识别(OCR)。对于大多数信息,使用easyOCR工具。对于组织块编号以及在第一轮OCR中结果不足的病例,应用pytesseract进行第二轮OCR。使用了两个数据集:一个用于工具开发,有342张切片;另一个用于测试,有110张切片。

结果

对于测试集,每张切片检索所有相关信息的总体准确率为0.982。值得注意的是,大多数信息部分的准确率为1.000,而组织块编号检测的准确率为0.982。

结论

“巴别鱼”工具可用于在图像分析流程中重命名大量全切片图像文件。此外,它可能是DICOM转换流程的重要组成部分,因为它可以提取病例编号、年份、组织块ID和染色等相关元数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d98/11616518/2b88e00d3ec9/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验