Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, Australia.
BMC Med Res Methodol. 2011 Apr 8;11:42. doi: 10.1186/1471-2288-11-42.
Cohort studies can provide valuable evidence of cause and effect relationships but are subject to loss of participants over time, limiting the validity of findings. Computerised record linkage offers a passive and ongoing method of obtaining health outcomes from existing routinely collected data sources. However, the quality of record linkage is reliant upon the availability and accuracy of common identifying variables. We sought to develop and validate a method for linking a cohort study to a state-wide hospital admissions dataset with limited availability of unique identifying variables.
A sample of 2000 participants from a cohort study (n = 41 514) was linked to a state-wide hospitalisations dataset in Victoria, Australia using the national health insurance (Medicare) number and demographic data as identifying variables. Availability of the health insurance number was limited in both datasets; therefore linkage was undertaken both with and without use of this number and agreement tested between both algorithms. Sensitivity was calculated for a sub-sample of 101 participants with a hospital admission confirmed by medical record review.
Of the 2000 study participants, 85% were found to have a record in the hospitalisations dataset when the national health insurance number and sex were used as linkage variables and 92% when demographic details only were used. When agreement between the two methods was tested the disagreement fraction was 9%, mainly due to "false positive" links when demographic details only were used. A final algorithm that used multiple combinations of identifying variables resulted in a match proportion of 87%. Sensitivity of this final linkage was 95%.
High quality record linkage of cohort data with a hospitalisations dataset that has limited identifiers can be achieved using combinations of a national health insurance number and demographic data as identifying variables.
队列研究可以提供有价值的因果关系证据,但随着时间的推移,参与者会流失,从而限制了研究结果的有效性。计算机化的记录链接提供了一种被动的、持续的方法,可以从现有的常规收集数据来源中获取健康结果。然而,记录链接的质量取决于共同识别变量的可用性和准确性。我们试图开发和验证一种方法,以便将一个队列研究与澳大利亚维多利亚州的全州住院数据集联系起来,该数据集可利用的独特识别变量有限。
从队列研究中抽取 2000 名参与者(n=41514)作为样本,使用国家健康保险(医疗保险)号码和人口统计学数据作为识别变量,将其与澳大利亚维多利亚州的全州住院数据集进行链接。两个数据集都有限制医疗保险号码的可用性;因此,链接在使用和不使用该号码的情况下都进行了,并且对两种算法之间的一致性进行了测试。对 101 名住院患者的子样本(通过病历审查确认)进行了敏感性计算。
在所研究的 2000 名参与者中,当使用国家健康保险号码和性别作为链接变量时,85%的人在住院数据集中有记录,当仅使用人口统计学详细信息时,92%的人有记录。当测试两种方法之间的一致性时,不一致部分为 9%,主要是由于仅使用人口统计学详细信息时出现了“假阳性”链接。最终使用多个识别变量组合的算法导致匹配比例为 87%。最终链接的敏感性为 95%。
使用国家健康保险号码和人口统计学数据等识别变量的组合,可以实现队列数据与标识符有限的住院数据集的高质量记录链接。