Alemi Farrokh, Loaiza Francisco, Vang Jee
College of Nursing and Health Sciences, George Mason University, 4400 University Drive, Fairfax, VA 22030, USA.
Health Care Manag Sci. 2007 Feb;10(1):95-104. doi: 10.1007/s10729-006-9002-7.
We show how Bayesian probability models can be used to integrate two databases, one of which does not have a key for uniquely identifying clients (e.g., social security number or medical record number). The analyst selects a set of imperfect identifiers (last visit diagnosis, first name, etc.). The algorithm assesses the likelihood ratio associated with the identifier from the database of known cases. It estimates the probability that two records belong to the same client from the likelihood ratios. As it proceeds in examining various identifiers, it accounts for inter-dependencies among them by allowing overlapping and redundant identifiers to be used. We test that the procedure is effective by examining data from the Medical Expenditure Panel Survey (MEPS) Population Characteristics data set, a publicly available data set. We randomly selected 1,000 cases for training data set--these constituted the known cases. The algorithm was used to identify if 100 cases not in the training data set would be misclassified in terms of being a case in the training set or a new case. With 12 fields as identifiers, all 100 cases were correctly classified as new cases. We also selected 100 known cases from the training set and asked the algorithm to classify these cases. Again, all 100 cases were correctly classified. Less accurate results were obtained when the training data set was too small (e.g., less than 100 records) or the number of fields used as identifiers was too small (e.g., less than seven fields). In a test of performance of the algorithm, when the ratio of testing to training data set exceeds 4 to 1, the accuracy of the algorithm exceeded 90% of cases. As the ratio increases, the accuracy of algorithm improves further. These data suggest the accuracy of our automated and mathematical procedure to merge data from two different data sets without the presence of a unique identifier. The algorithm uses imperfect and overlapping clues to re-identify cases from information not typically considered to be a patient identifier.
我们展示了贝叶斯概率模型如何用于整合两个数据库,其中一个数据库没有用于唯一标识客户的键(例如,社会保险号或病历号)。分析师选择一组不完美标识符(上次就诊诊断、名字等)。该算法从已知病例数据库中评估与该标识符相关的似然比。它根据似然比估计两条记录属于同一客户的概率。在检查各种标识符时,它通过允许使用重叠和冗余标识符来考虑它们之间的相互依赖性。我们通过检查医疗支出小组调查(MEPS)人口特征数据集(一个公开可用的数据集)的数据来测试该程序是否有效。我们随机选择1000个病例作为训练数据集——这些构成了已知病例。该算法用于确定不在训练数据集中的100个病例是否会在是否属于训练集中的病例或新病例方面被错误分类。以12个字段作为标识符,所有100个病例都被正确分类为新病例。我们还从训练集中选择了100个已知病例,并要求该算法对这些病例进行分类。同样,所有100个病例都被正确分类。当训练数据集太小(例如,少于100条记录)或用作标识符的字段数量太少(例如,少于7个字段)时,会得到不太准确的结果。在该算法的性能测试中,当测试数据集与训练数据集的比例超过4比1时,该算法的准确率超过90%的病例。随着该比例增加,算法的准确率进一步提高。这些数据表明我们的自动化数学程序在没有唯一标识符的情况下合并来自两个不同数据集的数据的准确性。该算法使用不完美和重叠的线索从通常不被视为患者标识符的信息中重新识别病例。