Englert Nils, Schwab Constantin, Legnar Maximilian, Weis Cleo-Aron
Section Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, Germany.
Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, Germany.
J Pathol Inform. 2024 Oct 23;15:100402. doi: 10.1016/j.jpi.2024.100402. eCollection 2024 Dec.
Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems.
The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides.
For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982.
The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.
从数字化切片或全切片图像文件中提取元数据是一项频繁、费力且繁琐的任务。在本研究中,我们展示了一种工具,可从扫描切片的宏观图像中自动提取所有相关切片信息,如病例编号、年份、切片编号、组织块编号和染色信息。我们将该工具命名为“巴别鱼”,因为它有助于翻译印在切片上的相关信息。它的编写基于某些基本假设,例如某些信息的位置。这可以根据各自的位置进行调整。然后,提取的元数据可用于将数字切片分类到数据库中,或将它们与实验室信息系统中的相关病例ID进行链接。
该工具基于光学字符识别(OCR)。对于大多数信息,使用easyOCR工具。对于组织块编号以及在第一轮OCR中结果不足的病例,应用pytesseract进行第二轮OCR。使用了两个数据集:一个用于工具开发,有342张切片;另一个用于测试,有110张切片。
对于测试集,每张切片检索所有相关信息的总体准确率为0.982。值得注意的是,大多数信息部分的准确率为1.000,而组织块编号检测的准确率为0.982。
“巴别鱼”工具可用于在图像分析流程中重命名大量全切片图像文件。此外,它可能是DICOM转换流程的重要组成部分,因为它可以提取病例编号、年份、组织块ID和染色等相关元数据。