光学字符识别（OCR）准确性对病理报告自动癌症分类的影响。

The impact of OCR accuracy on automated cancer classification of pathology reports.

作者信息

Zuccon Guido, Nguyen Anthony N, Bergheim Anton, Wickman Sandra, Grayson Narelle

机构信息

The Australian e-Health Research Centre, CSIRO ICT Centre, Brisbane, Australia.

出版信息

Stud Health Technol Inform. 2012;178:250-6.

PMID:22797049

Abstract

OBJECTIVE

To evaluate the effects of Optical Character Recognition (OCR) on the automatic cancer classification of pathology reports.

METHOD

Scanned images of pathology reports were converted to electronic free-text using a commercial OCR system. A state-of-the-art cancer classification system, the Medical Text Extraction (MEDTEX) system, was used to automatically classify the OCR reports. Classifications produced by MEDTEX on the OCR versions of the reports were compared with the classification from a human amended version of the OCR reports.

RESULTS

The employed OCR system was found to recognise scanned pathology reports with up to 99.12% character accuracy and up to 98.95% word accuracy. Errors in the OCR processing were found to minimally impact on the automatic classification of scanned pathology reports into notifiable groups. However, the impact of OCR errors is not negligible when considering the extraction of cancer notification items, such as primary site, histological type, etc.

CONCLUSIONS

The automatic cancer classification system used in this work, MEDTEX, has proven to be robust to errors produced by the acquisition of freetext pathology reports from scanned images through OCR software. However, issues emerge when considering the extraction of cancer notification items.

摘要

目的

评估光学字符识别（OCR）对病理报告自动癌症分类的影响。

方法

使用商业OCR系统将病理报告的扫描图像转换为电子自由文本。采用一种先进的癌症分类系统——医学文本提取（MEDTEX）系统对OCR报告进行自动分类。将MEDTEX对报告OCR版本的分类结果与OCR报告人工修正版本的分类结果进行比较。

结果

发现所使用的OCR系统识别扫描病理报告的字符准确率高达99.12%，单词准确率高达98.95%。发现OCR处理中的错误对将扫描病理报告自动分类到应报告组的影响最小。然而，在考虑提取癌症报告项目（如原发部位、组织学类型等）时，OCR错误的影响不可忽略。

结论

本研究中使用的自动癌症分类系统MEDTEX已被证明对通过OCR软件从扫描图像中获取自由文本病理报告所产生的错误具有鲁棒性。然而，在考虑提取癌症报告项目时会出现问题。

相似文献

The impact of OCR accuracy on automated cancer classification of pathology reports.

Stud Health Technol Inform. 2012;178:250-6.

Classification of pathology reports for cancer registry notifications.

Stud Health Technol Inform. 2012;178:150-6.

Automatic extraction of cancer characteristics from free-text pathology reports for cancer notifications.

Stud Health Technol Inform. 2011;168:117-24.

Automated Cancer Registry Notifications: Validation of a Medical Text Analytics System for Identifying Patients with Cancer from a State-Wide Pathology Repository.

AMIA Annu Symp Proc. 2017 Feb 10;2016:964-973. eCollection 2016.

Automatic classification of scanned electronic health record documents.

Int J Med Inform. 2020 Dec;144:104302. doi: 10.1016/j.ijmedinf.2020.104302. Epub 2020 Oct 17.

Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model.

J Biomed Inform. 2009 Oct;42(5):937-49. doi: 10.1016/j.jbi.2008.12.005. Epub 2008 Dec 27.

Facilitating clinical research through automation: Combining optical character recognition with natural language processing.

Clin Trials. 2022 Oct;19(5):504-511. doi: 10.1177/17407745221093621. Epub 2022 May 24.

Design of an automatic coding algorithm for a multi-axial classification in pathology.

Stud Health Technol Inform. 2008;136:815-20.

Automated categorisation of clinical incident reports using statistical text classification.

Qual Saf Health Care. 2010 Dec;19(6):e55. doi: 10.1136/qshc.2009.036657. Epub 2010 Aug 19.

Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports.

Gastrointest Endosc. 2021 Mar;93(3):750-757. doi: 10.1016/j.gie.2020.08.038. Epub 2020 Sep 3.

引用本文的文献

Salience of Medical Concepts of Inside Clinical Texts and Outside Medical Records for Referred Cardiovascular Patients.

J Healthc Inform Res. 2019 Jan 28;3(2):200-219. doi: 10.1007/s41666-019-00044-5. eCollection 2019 Jun.

Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle.

BMJ Open. 2020 Jun 11;10(6):e037740. doi: 10.1136/bmjopen-2020-037740.

A review of medical terminology standards and structured reporting.

J Vet Diagn Invest. 2018 Jan;30(1):17-25. doi: 10.1177/1040638717738276. Epub 2017 Oct 15.

Classification of cancer-related death certificates using machine learning.

Australas Med J. 2013 May 30;6(5):292-9. doi: 10.4066/AMJ.2013.1654. Print 2013.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

光学字符识别（OCR）准确性对病理报告自动癌症分类的影响。

The impact of OCR accuracy on automated cancer classification of pathology reports.

作者信息

机构信息

出版信息

OBJECTIVE

METHOD

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献