一种自动化数据验证方法，用于提高临床注册中的数据质量。

An automated data verification approach for improving data quality in a clinical registry.

机构信息

College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, 310027 Hanghzou, China; Key Laboratory for Biomedical Engineering, Ministry of Education, China.

College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, 310027 Hanghzou, China; Key Laboratory for Biomedical Engineering, Ministry of Education, China; School of Industrial Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands.

出版信息

Comput Methods Programs Biomed. 2019 Nov;181:104840. doi: 10.1016/j.cmpb.2019.01.012. Epub 2019 Jan 31.

DOI:10.1016/j.cmpb.2019.01.012

PMID:30777618

Abstract

BACKGROUND AND OBJECTIVE

The quality of data is crucial for clinical registry studies as it impacts credibility. In the regular practice of most such studies, a vulnerability arises from researchers recording data on paper-based case report forms (CRFs) and further transcribing them onto registry databases. To ensure the quality of data, verifying data in the registry is necessary. However, traditional manual data verification methods are time-consuming, labor-intensive and of limited-effect. As paper-based CRFs and electronic medical records (EMRs) are two sources for verification, we propose an automated data verification approach based on the techniques of optical character recognition (OCR) and information retrieval to identify data errors in a registry more efficiently.

METHODS

Three steps are involved to develop the automated verification approach. First, we analyze the scanned images of paper-based CRFs with machine learning enhanced OCR to recognize the checkbox marks and hand-writing. Then, we retrieve the related patient information from the EMRs using natural language processing (NLP) techniques. Finally, we compare the retrieved information in the previous two steps with the data in the registry, and synthesize the results accordingly. The proposed automated method has been applied in a Chinese registry study and the difference between automated and manual approach has been evaluated.

RESULTS

The automated approach has been implemented in The Chinese Coronary Artery Disease Registry. For CRF data recognition, the accuracy of recognition for checkboxes marks and hand-writing are 0.93 and 0.74, respectively. For EMR data extraction, the accuracy of information retrieval from textual electronic medical records is 0.97. The accuracy, recall and time consumption of the automated approach are 0.93, 0.96 and 0.5 h, better than the corresponding values of the manual approach, which are 0.92, 0.71 and 7.5 h.

CONCLUSIONS

Compared to the manual data verification approach, the automated approach enhances the recall of identify data errors and has a higher accuracy. The time consumed is far less. The results show that the automated approach is more effective and efficient for identifying incomplete data and incorrect data in a registry. The proposed approach has potential to improve the quality of registry data.

摘要

背景与目的

数据质量对于临床注册研究至关重要，因为它会影响可信度。在大多数此类研究的常规实践中，研究人员在纸质病例报告表（CRF）上记录数据，并进一步将其转录到注册数据库中，这会产生一个漏洞。为了确保数据质量，有必要对注册中的数据进行验证。然而，传统的手动数据验证方法既耗时又费力，效果有限。由于纸质 CRF 和电子病历（EMR）是两种验证来源，我们提出了一种基于光学字符识别（OCR）和信息检索技术的自动化数据验证方法，以更有效地识别注册中的数据错误。

方法

开发自动化验证方法涉及三个步骤。首先，我们使用机器学习增强的 OCR 分析纸质 CRF 的扫描图像，以识别复选框标记和手写体。然后，我们使用自然语言处理（NLP）技术从 EMR 中检索相关患者信息。最后，我们将前两个步骤中检索到的信息与注册中的数据进行比较，并相应地综合结果。该方法已应用于中国的一项注册研究，并评估了自动化方法与手动方法之间的差异。

结果

该自动化方法已在中国冠状动脉疾病注册研究中实施。对于 CRF 数据识别，复选框标记和手写体的识别准确率分别为 0.93 和 0.74。对于 EMR 数据提取，从文本电子病历中检索信息的准确率为 0.97。自动化方法的准确率、召回率和耗时分别为 0.93、0.96 和 0.5 小时，优于手动方法的相应值 0.92、0.71 和 7.5 小时。

结论

与手动数据验证方法相比，自动化方法提高了识别数据错误的召回率，并且具有更高的准确率。所消耗的时间要少得多。结果表明，该自动化方法在识别注册中不完整数据和错误数据方面更有效率。所提出的方法有可能提高注册数据的质量。

相似文献

An automated data verification approach for improving data quality in a clinical registry.

Comput Methods Programs Biomed. 2019 Nov;181:104840. doi: 10.1016/j.cmpb.2019.01.012. Epub 2019 Jan 31.

[A customized method for information extraction from unstructured text data in the electronic medical records].

Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):256-263.

Facilitating clinical research through automation: Combining optical character recognition with natural language processing.

Clin Trials. 2022 Oct;19(5):504-511. doi: 10.1177/17407745221093621. Epub 2022 May 24.

Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records.

Ups J Med Sci. 2020 Nov;125(4):316-324. doi: 10.1080/03009734.2020.1792010. Epub 2020 Jul 22.

Automating Ischemic Stroke Subtype Classification Using Machine Learning and Natural Language Processing.

J Stroke Cerebrovasc Dis. 2019 Jul;28(7):2045-2051. doi: 10.1016/j.jstrokecerebrovasdis.2019.02.004. Epub 2019 May 15.

Data for registry and quality review can be retrospectively collected using natural language processing from unstructured charts of arthroplasty patients.

Bone Joint J. 2020 Jul;102-B(7_Supple_B):99-104. doi: 10.1302/0301-620X.102B7.BJJ-2019-1574.R1.

Information extraction from multi-institutional radiology reports.

Artif Intell Med. 2016 Jan;66:29-39. doi: 10.1016/j.artmed.2015.09.007. Epub 2015 Oct 3.

A method for cohort selection of cardiovascular disease records from an electronic health record system.

Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30.

Layout-aware information extraction from semi-structured medical images.

Comput Biol Med. 2019 Apr;107:235-247. doi: 10.1016/j.compbiomed.2019.02.016. Epub 2019 Feb 25.

Obtaining Knowledge in Pathology Reports Through a Natural Language Processing Approach With Classification, Named-Entity Recognition, and Relation-Extraction Heuristics.

JCO Clin Cancer Inform. 2019 Aug;3:1-8. doi: 10.1200/CCI.19.00008.

引用本文的文献

Validity and accuracy of swespine data on surgery for central lumbar spinal stenosis and lumbar disc herniation: a cohort study of 796 patients.

Eur Spine J. 2025 Jun 12. doi: 10.1007/s00586-025-09049-8.

Leveraging healthcare professionals' insights to enhance data quality in medical big data platforms: A qualitative study.

Digit Health. 2025 Mar 17;11:20552076251326697. doi: 10.1177/20552076251326697. eCollection 2025 Jan-Dec.

Data Safety Monitoring Boards: Overview of Structure and Role in Spinal Cord Injury Studies.

Top Spinal Cord Inj Rehabil. 2024 Summer;30(3):67-75. doi: 10.46292/sci23-00084. Epub 2024 Aug 8.

Data Quality in Health Research: Integrative Literature Review.

J Med Internet Res. 2023 Oct 31;25:e41446. doi: 10.2196/41446.

Reliability and Efficiency of the CAPRI-3 Metastatic Prostate Cancer Registry Driven by Artificial Intelligence.

Cancers (Basel). 2023 Jul 27;15(15):3808. doi: 10.3390/cancers15153808.

Deep learning-based NLP data pipeline for EHR-scanned document information extraction.

JAMIA Open. 2022 Jun 11;5(2):ooac045. doi: 10.1093/jamiaopen/ooac045. eCollection 2022 Jul.

The role of machine learning in clinical research: transforming the future of evidence generation.

Trials. 2021 Aug 16;22(1):537. doi: 10.1186/s13063-021-05489-x.

Design and development of a web-based registry for Coronavirus (COVID-19) disease.

Med J Islam Repub Iran. 2020 Jun 25;34:68. doi: 10.34171/mjiri.34.68. eCollection 2020.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种自动化数据验证方法，用于提高临床注册中的数据质量。

An automated data verification approach for improving data quality in a clinical registry.

机构信息

出版信息

BACKGROUND AND OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景与目的

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献