Nasseh D, Stausberg J
Daniel Nasseh, Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, Ludwig-Maximilians-Universität München, Marchioninistraße 15, 81377 Munich, Germany, E-mail:
Methods Inf Med. 2016;55(2):136-43. doi: 10.3414/ME14-01-0087. Epub 2015 Apr 20.
The process of merging data of different data sources is referred to as record linkage. A medical environment with increased preconditions on privacy protection demands the transformation of clear-text attributes like first name or date of birth into one-way encrypted pseudonyms. When performing an automated or privacy preserving record linkage there might be the need of a binary classification deciding whether two records should be classified as the same entity. The classification is the final of the four main phases of the record linkage process: Preprocessing, indexing, matching and classification. The choice of binary classification techniques in dependence of project specifications in particular data quality has not extensively been studied yet.
The aim of this work is the introduction and evaluation of an automatable semi-supervised binary classification system applied within the field of record linkage capable of competing or even surpassing advanced automated techniques of the domain of unsupervised classification.
This work describes the rationale leading to the model and the final implementation of an automatable semi-supervised binary classification system and the comparison of its classification performance to an advanced active learning approach out of the domain of unsupervised learning. The performance of both systems has been measured on a broad variety of artificial test sets (n = 400), based on real patient data, with distinct and unique characteristics.
While the classification performance for both methods measured as F-measure was relatively close on test sets with maximum defined data quality, 0.996 for semi-supervised classification, 0.993 for unsupervised classification, it incrementally diverged for test sets of worse data quality dropping to 0.964 for semi-supervised classification and 0.803 for unsupervised classification.
Aside from supplying a viable model for semi-supervised classification for automated probabilistic record linkage, the tests conducted on a large amount of test sets suggest that semi-supervised techniques might generally be capable of outperforming unsupervised techniques especially on data with lower levels of data quality.
合并不同数据源数据的过程称为记录链接。在隐私保护要求日益提高的医疗环境中,需要将诸如名字或出生日期等明文属性转换为单向加密的假名。在执行自动或隐私保护记录链接时,可能需要进行二元分类,以确定两条记录是否应分类为同一实体。该分类是记录链接过程四个主要阶段中的最后一个阶段:预处理、索引、匹配和分类。尚未广泛研究根据项目规范(特别是数据质量)选择二元分类技术的情况。
本研究的目的是引入并评估一种可自动化的半监督二元分类系统,该系统应用于记录链接领域,能够与无监督分类领域的先进自动化技术竞争甚至超越它们。
本研究描述了导致该模型的基本原理、可自动化半监督二元分类系统的最终实现,以及将其分类性能与无监督学习领域的一种先进主动学习方法进行比较。基于真实患者数据,在具有不同且独特特征的大量人工测试集(n = 400)上测量了这两种系统的性能。
在数据质量定义为最高的测试集上,两种方法以F值衡量的分类性能相对接近,半监督分类为0.996,无监督分类为0.993;而在数据质量较差的测试集中,两者逐渐出现差异,半监督分类降至0.964,无监督分类降至0.803。
除了为自动概率记录链接的半监督分类提供一个可行模型外,在大量测试集上进行的测试表明,半监督技术通常可能优于无监督技术,尤其是在数据质量较低的数据上。