Zuccon Guido, Kotzur Daniel, Nguyen Anthony, Bergheim Anton
The Australian e-Health Research Centre (Commonwealth Scientific and Industrial Research Organisation), Level 5 - UQ Health Sciences Building 901/16, Royal Brisbane and Women's Hospital, Herston, QLD 4029, Australia; School of Information Systems, Queensland University of Technology, Y Block Level 6, Gardens Point Campus, Brisbane, QLD, Australia.
The Australian e-Health Research Centre (Commonwealth Scientific and Industrial Research Organisation), Level 5 - UQ Health Sciences Building 901/16, Royal Brisbane and Women's Hospital, Herston, QLD 4029, Australia.
Artif Intell Med. 2014 Jul;61(3):145-51. doi: 10.1016/j.artmed.2014.03.006. Epub 2014 Apr 3.
Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness.
The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors.
Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training.
Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data.
评估Anonym工具的有效性和稳健性。Anonym是一种基于条件随机场分类器对自由文本健康记录进行去识别处理的工具,该分类器由语言和词汇特征以及通过模式匹配技术提取的特征提供信息。对电子健康记录中的个人健康信息进行去识别处理对于临床数据的共享和二次使用至关重要。能够适应不同临床数据源的去识别工具很有吸引力,因为它们只需最少的干预就能保证高效性。
在多个数据集上评估Anonym的有效性和稳健性,包括广泛采用的“整合生物学与床边应用”(i2b2)数据集,该数据集用于一次去识别挑战赛的评估。这里使用的数据集在健康记录类型、数据来源及其质量方面各不相同,其中一个数据集包含光学字符识别错误。
在i2b2数据集上,Anonym识别并去除了高达96.6%的个人健康标识符(召回率),精确率高达98.2%,优于i2b2挑战赛中提出的最佳系统。发现Anonym在各个数据集上的有效性取决于可用于训练的信息量。
研究结果表明,Anonym与2006年i2b2共享任务中的最佳方法相当。使用新数据集对Anonym进行重新训练很容易;如果重新训练,在有足够训练数据的情况下,该系统对训练规模、数据类型和质量的变化具有稳健性。