Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Germany.
J Biomed Inform. 2011 Aug;44(4):648-54. doi: 10.1016/j.jbi.2011.02.008. Epub 2011 Feb 23.
Cleansing data from synonyms and homonyms is a relevant task in fields where high quality of data is crucial, for example in disease registries and medical research networks. Record linkage provides methods for minimizing synonym and homonym errors thereby improving data quality. We focus our attention to the case of homonym errors (in the following denoted as 'false matches'), in which records belonging to different entities are wrongly classified as equal. Synonym errors ('false non-matches') occur when a single entity maps to multiple records in the linkage result. They are not considered in this study because in our application domain they are not as crucial as false matches. False match rates are frequently computed manually through a clerical review, so without modelling the distribution of the false match rates a priori. An exception is the work of Belin and Rubin (1995) [4]. They propose to estimate the false match rate by means of a normal mixture model that needs training data for a calibration process. In this paper we present a new approach for estimating the false match rate within the framework of Fellegi and Sunter by methods of Extreme Value Theory (EVT). This approach needs no training data for determining the threshold for matches and therefore leads to a significant cost-reduction. After giving two different definitions of the false match rate, we present the tools of the EVT used in this paper: the generalized Pareto distribution and the mean excess plot. Our experiments with real data show that the model works well, with only slightly lower accuracy compared to a procedure that has information about the match status and that maximizes the accuracy.
从同义词和同音字中清除数据是在数据质量至关重要的领域中的一项相关任务,例如在疾病登记处和医学研究网络中。记录链接提供了最小化同义词和同音字错误的方法,从而提高了数据质量。我们专注于同音字错误(在以下简称“假匹配”)的情况,其中属于不同实体的记录被错误地归类为相等。同义词错误(“假非匹配”)发生在单个实体映射到链接结果中的多个记录时。在本研究中不考虑它们,因为在我们的应用领域中,它们不如假匹配重要。假匹配率通常通过人工审核手动计算,因此在没有先验地对假匹配率分布进行建模的情况下。Belin 和 Rubin(1995)[4]的工作是一个例外。他们提出通过需要训练数据进行校准过程的正态混合模型来估计假匹配率。在本文中,我们提出了一种在 Fellegi 和 Sunter 框架内通过极值理论(EVT)方法估计假匹配率的新方法。这种方法不需要训练数据来确定匹配的阈值,因此可以显著降低成本。在给出假匹配率的两种不同定义之后,我们介绍了本文中使用的 EVT 工具:广义 Pareto 分布和均值过剩图。我们对真实数据的实验表明,该模型的效果很好,与具有匹配状态信息并最大化准确性的过程相比,准确性略低。