无真实数据？没问题：使用主动学习和一点策略改进行政数据链接。

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile.

机构信息

Department of Criminology and Criminal Justice, University of Maryland, College Park, MD, United States of America.

College of Information Studies, University of Maryland, College Park, MD, United States of America.

出版信息

PLoS One. 2023 Apr 4;18(4):e0283811. doi: 10.1371/journal.pone.0283811. eCollection 2023.

DOI:10.1371/journal.pone.0283811

PMID:37014897

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10072450/

Abstract

While linking records across large administrative datasets ["big data"] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to "ground-truth" examples-matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use "active learning" algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

摘要

虽然在大型行政数据集（“大数据”）中链接记录有可能彻底改变实证社会科学研究，但许多行政数据文件没有通用标识符，因此不设计为可与其他文件链接。为了解决这个问题，研究人员开发了概率记录链接算法，该算法使用识别特征的统计模式来执行链接任务。自然地，当算法可以访问“真实”示例（可以使用机构知识或辅助数据进行验证的匹配）时，候选链接算法的准确性可以大大提高。不幸的是，获取这些示例的成本通常很高，通常需要研究人员手动审查记录对，以便对它们是否匹配做出明智的判断。当没有真实信息池时，研究人员可以使用链接的“主动学习”算法，该算法要求用户为选定的候选对提供真实信息。在本文中，我们研究了通过主动学习为链接性能提供真实信息的价值。我们证实了一个流行的直觉，即提供真实信息可以极大地提高数据链接性能。但至关重要的是，在许多实际应用中，只需要相对较少的策略性选择的真实信息即可获得大部分可实现的收益。只需对真实信息进行适度投资，研究人员就可以使用现成的现成工具，模拟可以访问大量真实信息数据库的监督学习算法的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d7d/10072450/2c476792bb7b/pone.0283811.g001.jpg

相似文献

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile.无真实数据？没问题：使用主动学习和一点策略改进行政数据链接。

PLoS One. 2023 Apr 4;18(4):e0283811. doi: 10.1371/journal.pone.0283811. eCollection 2023.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort.基于巴西 1.14 亿队列的概率数据链接的准确性和可扩展性研究

IEEE J Biomed Health Inform. 2018 Mar;22(2):346-353. doi: 10.1109/JBHI.2018.2796941.

An open-source probabilistic record linkage process for records with family-level information: Simulation study and applied analysis.具有家庭级信息的记录的开源概率记录链接过程：模拟研究和应用分析。

PLoS One. 2023 Oct 20;18(10):e0291581. doi: 10.1371/journal.pone.0291581. eCollection 2023.

Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study.使用统计链接键在多个数据集之间进行记录链接的经验方面：PIAC 队列研究的经验。

BMC Health Serv Res. 2010 Feb 18;10:41. doi: 10.1186/1472-6963-10-41.

Validating a novel deterministic privacy-preserving record linkage between administrative & clinical data: applications in stroke research.验证一种新颖的行政与临床数据确定性隐私保护记录链接方法：在中风研究中的应用。

Int J Popul Data Sci. 2022 Nov 22;7(4):1755. doi: 10.23889/ijpds.v7i4.1755. eCollection 2022.

Comparing Methods for Record Linkage for Public Health Action: Matching Algorithm Validation Study.比较公共卫生行动记录链接的方法：匹配算法验证研究。

JMIR Public Health Surveill. 2020 Apr 30;6(2):e15917. doi: 10.2196/15917.

Validating the accuracy of administrative healthcare data identifying epilepsy in deceased adults: A Scottish data linkage study.验证行政医疗保健数据识别已故成年人癫痫的准确性：苏格兰数据链接研究。

Epilepsy Res. 2020 Nov;167:106462. doi: 10.1016/j.eplepsyres.2020.106462. Epub 2020 Sep 13.

Utilising identifier error variation in linkage of large administrative data sources.利用大型行政数据源链接中的标识符错误变异。

BMC Med Res Methodol. 2017 Feb 7;17(1):23. doi: 10.1186/s12874-017-0306-8.

Linking Individual Data From the Spinal Cord Injury Model Systems Center and Local Trauma Registry: Development and Validation of Probabilistic Matching Algorithm.链接脊髓损伤模型系统中心和当地创伤登记处的个体数据：概率匹配算法的开发和验证。

Top Spinal Cord Inj Rehabil. 2020;26(4):221-231. doi: 10.46292/sci20-00015. Epub 2021 Jan 20.

本文引用的文献

A New Strategy for Linking U.S. Historical Censuses: A Case Study for the IPUMS Multigenerational Longitudinal Panel.连接美国历史人口普查的新策略：以综合公共使用微观数据系列多代纵向面板为例

Hist Methods. 2022;55(1):12-29. doi: 10.1080/01615440.2021.1985027. Epub 2021 Nov 11.

Enhancing the ATra Black Box Matching Algorithm: Use of All Names for Deduplication Across Jurisdictions.增强 ATra 黑盒匹配算法：跨司法辖区使用所有名称进行去重。

Public Health Rep. 2023 Jan-Feb;138(1):54-61. doi: 10.1177/00333549211066171. Epub 2022 Jan 21.

Economics in the age of big data.大数据时代的经济学。

Science. 2014 Nov 7;346(6210):1243089. doi: 10.1126/science.1243089.

Frequently asked questions (FAQ).常见问题解答 (FAQ)。

J Herb Pharmacother. 2006;6(1):77-87.

Publication bias in clinical research.临床研究中的发表偏倚。

Lancet. 1991 Apr 13;337(8746):867-72. doi: 10.1016/0140-6736(91)90201-y.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

无真实数据？没问题：使用主动学习和一点策略改进行政数据链接。

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献