College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, 310027 Hanghzou, China; Key Laboratory for Biomedical Engineering, Ministry of Education, China.
College of Biomedical Engineering and Instrument Science, Zhejiang University, Zheda Road, 310027 Hanghzou, China; Key Laboratory for Biomedical Engineering, Ministry of Education, China; School of Industrial Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands.
Comput Methods Programs Biomed. 2019 Nov;181:104840. doi: 10.1016/j.cmpb.2019.01.012. Epub 2019 Jan 31.
The quality of data is crucial for clinical registry studies as it impacts credibility. In the regular practice of most such studies, a vulnerability arises from researchers recording data on paper-based case report forms (CRFs) and further transcribing them onto registry databases. To ensure the quality of data, verifying data in the registry is necessary. However, traditional manual data verification methods are time-consuming, labor-intensive and of limited-effect. As paper-based CRFs and electronic medical records (EMRs) are two sources for verification, we propose an automated data verification approach based on the techniques of optical character recognition (OCR) and information retrieval to identify data errors in a registry more efficiently.
Three steps are involved to develop the automated verification approach. First, we analyze the scanned images of paper-based CRFs with machine learning enhanced OCR to recognize the checkbox marks and hand-writing. Then, we retrieve the related patient information from the EMRs using natural language processing (NLP) techniques. Finally, we compare the retrieved information in the previous two steps with the data in the registry, and synthesize the results accordingly. The proposed automated method has been applied in a Chinese registry study and the difference between automated and manual approach has been evaluated.
The automated approach has been implemented in The Chinese Coronary Artery Disease Registry. For CRF data recognition, the accuracy of recognition for checkboxes marks and hand-writing are 0.93 and 0.74, respectively. For EMR data extraction, the accuracy of information retrieval from textual electronic medical records is 0.97. The accuracy, recall and time consumption of the automated approach are 0.93, 0.96 and 0.5 h, better than the corresponding values of the manual approach, which are 0.92, 0.71 and 7.5 h.
Compared to the manual data verification approach, the automated approach enhances the recall of identify data errors and has a higher accuracy. The time consumed is far less. The results show that the automated approach is more effective and efficient for identifying incomplete data and incorrect data in a registry. The proposed approach has potential to improve the quality of registry data.
数据质量对于临床注册研究至关重要,因为它会影响可信度。在大多数此类研究的常规实践中,研究人员在纸质病例报告表(CRF)上记录数据,并进一步将其转录到注册数据库中,这会产生一个漏洞。为了确保数据质量,有必要对注册中的数据进行验证。然而,传统的手动数据验证方法既耗时又费力,效果有限。由于纸质 CRF 和电子病历(EMR)是两种验证来源,我们提出了一种基于光学字符识别(OCR)和信息检索技术的自动化数据验证方法,以更有效地识别注册中的数据错误。
开发自动化验证方法涉及三个步骤。首先,我们使用机器学习增强的 OCR 分析纸质 CRF 的扫描图像,以识别复选框标记和手写体。然后,我们使用自然语言处理(NLP)技术从 EMR 中检索相关患者信息。最后,我们将前两个步骤中检索到的信息与注册中的数据进行比较,并相应地综合结果。该方法已应用于中国的一项注册研究,并评估了自动化方法与手动方法之间的差异。
该自动化方法已在中国冠状动脉疾病注册研究中实施。对于 CRF 数据识别,复选框标记和手写体的识别准确率分别为 0.93 和 0.74。对于 EMR 数据提取,从文本电子病历中检索信息的准确率为 0.97。自动化方法的准确率、召回率和耗时分别为 0.93、0.96 和 0.5 小时,优于手动方法的相应值 0.92、0.71 和 7.5 小时。
与手动数据验证方法相比,自动化方法提高了识别数据错误的召回率,并且具有更高的准确率。所消耗的时间要少得多。结果表明,该自动化方法在识别注册中不完整数据和错误数据方面更有效率。所提出的方法有可能提高注册数据的质量。