Jaro M A
Match Ware Technologies, Inc., Silver Spring, MD 20905, USA.
Stat Med. 1995;14(5-7):491-8. doi: 10.1002/sim.4780140510.
Probabilistic linkage technology makes it feasible and efficient to link large public health databases in a statistically justifiable manner. The problem addressed by the methodology is that of matching two files of individual data under conditions of uncertainty. Each field is subject to error which is measured by the probability that the field agrees given a record pair matches (called the m probability) and probabilities of chance agreement of its value states (called the u probability). Fellegi and Sunter pioneered record linkage theory. Advances in methodology include use of an EM algorithm for parameter estimation, optimization of matches by means of a linear sum assignment program, and more recently, a probability model that addresses both m and u probabilities for all value states of a field. This provides a means for obtaining greater precision from non-uniformly distributed fields, without the theoretical complications arising from frequency-based matching alone. The model includes an iterative parameter estimation procedure that is more robust than pre-match estimation techniques. The methodology was originally developed and tested by the author at the U.S. Census Bureau for census undercount estimation. The more recent advances and a new generalized software system were tested and validated by linking highway crashes to Emergency Medical Service (EMS) reports and to hospital admission records for the National Highway Traffic Safety Administration (NHTSA).
概率链接技术使得以统计上合理的方式链接大型公共卫生数据库变得可行且高效。该方法所解决的问题是在不确定条件下匹配两个个人数据文件。每个字段都存在误差,该误差通过给定记录对匹配时字段一致的概率(称为m概率)及其值状态的随机一致概率(称为u概率)来衡量。费勒吉和桑特开创了记录链接理论。方法学上的进展包括使用期望最大化(EM)算法进行参数估计、通过线性和分配程序优化匹配,以及最近提出的一种针对字段所有值状态同时考虑m和u概率的概率模型。这为从不均匀分布的字段中获得更高精度提供了一种方法,而不会出现仅基于频率匹配所产生的理论复杂性。该模型包括一个迭代参数估计程序,它比匹配前的估计技术更稳健。该方法最初由作者在美国人口普查局开发并用于人口普查漏计估计测试。最近的进展以及一个新的通用软件系统通过将高速公路撞车事故与紧急医疗服务(EMS)报告以及美国国家公路交通安全管理局(NHTSA)的医院入院记录相链接进行了测试和验证。