Department of Criminology and Criminal Justice, University of Maryland, College Park, MD, United States of America.
College of Information Studies, University of Maryland, College Park, MD, United States of America.
PLoS One. 2023 Apr 4;18(4):e0283811. doi: 10.1371/journal.pone.0283811. eCollection 2023.
While linking records across large administrative datasets ["big data"] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to "ground-truth" examples-matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use "active learning" algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.
虽然在大型行政数据集(“大数据”)中链接记录有可能彻底改变实证社会科学研究,但许多行政数据文件没有通用标识符,因此不设计为可与其他文件链接。为了解决这个问题,研究人员开发了概率记录链接算法,该算法使用识别特征的统计模式来执行链接任务。自然地,当算法可以访问“真实”示例(可以使用机构知识或辅助数据进行验证的匹配)时,候选链接算法的准确性可以大大提高。不幸的是,获取这些示例的成本通常很高,通常需要研究人员手动审查记录对,以便对它们是否匹配做出明智的判断。当没有真实信息池时,研究人员可以使用链接的“主动学习”算法,该算法要求用户为选定的候选对提供真实信息。在本文中,我们研究了通过主动学习为链接性能提供真实信息的价值。我们证实了一个流行的直觉,即提供真实信息可以极大地提高数据链接性能。但至关重要的是,在许多实际应用中,只需要相对较少的策略性选择的真实信息即可获得大部分可实现的收益。只需对真实信息进行适度投资,研究人员就可以使用现成的现成工具,模拟可以访问大量真实信息数据库的监督学习算法的性能。