O'Rourke M K, Fernandez L M, Bittel C N, Sherrill J L, Blackwell T S, Robbins D R
Environmental and Occupational Health Unit of the Arizona Prevention Center, The University of Arizona, Tucson 85721-0468, USA.
J Expo Anal Environ Epidemiol. 1999 Sep-Oct;9(5):471-84. doi: 10.1038/sj.jea.7500043.
Data entry and management are critical components of all large survey projects; data quality objectives must be met and data must be quickly and readily accessible. We developed a comprehensive system for data entry and management utilizing scannable forms with bubble fields and handwriting recognition. This 'Mass Data Massage' (MDM) system had three components: (1) form creation and database definition; (2) programming of data dictionaries for documentation and preliminary logic and range checks; and (3) data entry, management and documentation using the 'Mass Data Cleaning Program' (MDCP). Scannable forms were written in Teleform, where the data field definition, variable names and ranges were defined as the form was created. Completed forms were returned from the field, subjected to final field quality control (QC) checks, and transferred to the data management section. They were batched and coded as necessary. Once a batch of data was scanned and visually verified, the operator called up the menu for the MDCP. The MDCP had 31 program modules with 500-1200 lines of code each. The operator could select and run the appropriate dictionary on each data batch 'correcting' apparent errors in responses. This process was iterative until the data batch passed all dictionary checks. Proposed 'changes' were forwarded to the data coordinator (DC) for acceptance or rejection. After all errors had been resolved, each data batch was subjected to a 10% quality assurance (QA) check. The original data batch and associated file of applied changes were archived. Time expenditure using the scanning approach varied with the number of questions and the types of responses (handwritten or bubble fields). One-page forms took 42-60% of the time needed for hand entry; forms longer than 10 pages took 35-38% of the time. Use of faster machines will further speed the process. The main advantage of the system was the reduction of systematic errors. Scanning alone reduced errors found on 995 NHEXAS Baseline Questionnaires. Overall, the dictionary identified 0.55% errors on the scanned forms. Ten percent QC checks, performed on corrected batches ready for appendage to the master database, revealed an overall error rate of 0.02%. Similar checks on a laboratory form scanned from numeric handwriting detected 0.3% errors following dictionary application and 0.2% errors during the 10% QA check. This system was faster, more accurate, and more cost-effective than hand entry of data. A batch of data that took >1 week to process using the hand entry method was processed within 1 day using MDM. Human coding of specific answers and the final verification were the most time-consuming processes.
数据录入和管理是所有大型调查项目的关键组成部分;必须实现数据质量目标,并且数据必须能够快速且方便地获取。我们开发了一个全面的数据录入和管理系统,该系统利用带有气泡字段和手写识别功能的可扫描表格。这个“海量数据处理”(MDM)系统有三个组成部分:(1)表单创建和数据库定义;(2)为文档编制以及初步逻辑和范围检查编写数据字典程序;(3)使用“海量数据清理程序”(MDCP)进行数据录入、管理和文档编制。可扫描表单是用Teleform编写的,在创建表单时就定义了数据字段定义、变量名和范围。填好的表单从实地返回,经过最终的字段质量控制(QC)检查,然后转移到数据管理部门。必要时对它们进行分批和编码。一旦一批数据被扫描并经过目视验证,操作员就会调出MDCP的菜单。MDCP有31个程序模块,每个模块有500 - 1200行代码。操作员可以为每个数据批次选择并运行适当的字典,以“纠正”回答中明显的错误。这个过程是迭代的,直到数据批次通过所有字典检查。提议的“更改”会转发给数据协调员(DC)以供接受或拒绝。在所有错误都得到解决后,对每个数据批次进行10%的质量保证(QA)检查。原始数据批次和应用更改的相关文件会被存档。使用扫描方法的时间花费因问题数量和回答类型(手写或气泡字段)而异。单页表单花费的时间是手工录入所需时间的42% - 60%;超过10页的表单花费的时间是35% - 38%。使用更快的机器将进一步加快这个过程。该系统的主要优点是减少了系统误差。仅扫描就减少了在995份NHEXAS基线调查问卷中发现的错误。总体而言,字典在扫描表单上识别出0.55%的错误。对准备附加到主数据库的已校正批次进行的10%的QC检查显示,总体错误率为0.02%。对一份从数字手写扫描而来的实验室表单进行的类似检查显示,应用字典后检测到0.3%的错误,在10%的QA检查期间检测到0.2%的错误。这个系统比手工录入数据更快、更准确且更具成本效益。一批使用手工录入方法需要超过1周时间处理的数据,使用MDM在1天内就处理完了。对特定答案进行人工编码和最终验证是最耗时的过程。