Bell R M, Keesey J, Richards T
RAND, Santa Monica, CA 90407-2138.
Med Care. 1994 Oct;32(10):1004-18.
This paper describes a procedure used to link Medicaid claims data to California vital statistics records for very low birthweight infants. The linkage involved about 53,000 infants born from 1980 to 1987 and 1.46 million claims for delivery/birth-related hospital admissions during the same period. Because the two data files did not share a unique identifier, record linkage required combining evidence across several linking variables: delivery hospital, delivery/birth date or hospitalization period, names, mother's age, and zip code. To combine the various pieces of evidence, we used record linkage theory to compute scores that measure the likelihood of a match, i.e., that two records correspond to the same delivery. These scores appropriately weight the various pieces of evidence for or against a match. Implementation required dealing with large amounts of missing data in one of the files, errors and variations in reported names, and the need to minimize the number of incorrect links. The approach applies to a wide range of linkage problems. The ability to combine existing datasets to form new datasets containing analysis variables from each facilitates analyses that would otherwise be impossible, or prohibitively expensive.
本文描述了一种将医疗补助计划索赔数据与加利福尼亚州极低出生体重婴儿的生命统计记录相链接的程序。这种链接涉及1980年至1987年出生的约53000名婴儿以及同期146万次与分娩/出生相关的住院索赔。由于这两个数据文件没有共享唯一标识符,记录链接需要综合多个链接变量的证据:分娩医院、分娩/出生日期或住院时间、姓名、母亲年龄和邮政编码。为了综合各种证据,我们使用记录链接理论来计算衡量匹配可能性的分数,即两条记录对应于同一分娩的可能性。这些分数对支持或反对匹配的各种证据进行了适当加权。实施过程需要处理其中一个文件中大量的缺失数据、报告姓名中的错误和差异,以及尽量减少错误链接数量的需求。该方法适用于广泛的链接问题。将现有数据集组合以形成包含来自每个数据集的分析变量的新数据集的能力有助于进行那些否则将不可能或成本过高的分析。