Haddock Beatrix, Pletcher Alix, Blair-Stahn Nathaniel, Keyes Os, Kappel Matt, Bachmeier Steve, Lutze Syl, Albright James, Bowman Alison, Kinuthia Caroline, Burke-Conte Zeb, Mudambi Rajan, Flaxman Abraham
Institute for Health Metrics and Evaluation, University of Washington, Seattle, Washington, 98195, USA.
Gates Open Res. 2024 Oct 18;8:36. doi: 10.12688/gatesopenres.15418.2. eCollection 2024.
Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released , a Python package that allows users to generate simulated datasets with configurable noise approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information.
We created the simulated population data available for noising with pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems.
Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.
实体解析(ER)是指识别和链接记录的过程,这些记录指的是同一现实世界实体。ER 是数据科学中的一个基本挑战,而 ER 研究和开发的一个常见障碍是,用于这种模糊匹配的数据字段是个人身份信息,如姓名、地址和出生日期。访问和共享这些真实数据的必要限制,已经减缓了开发、测试和采用新的 ER 方法和软件的工作。我们最近发布了 ,这是一个 Python 包,允许用户生成具有可配置噪声的模拟数据集,这些噪声接近大型组织和联邦机构(如美国人口普查局)定期执行 ER 的数据的规模和复杂性。有了伪人,研究人员可以在不需要访问个人和机密信息的情况下,为美国人口数据的 ER 开发新的算法和软件。
我们使用 Vivarium 模拟平台创建了可用于对伪人进行噪声处理的模拟人口数据。我们的模型模拟了个人及其家庭、家庭和就业动态随时间的变化,我们通过模拟人口普查、调查和行政数据收集系统来观察这些变化。
我们的模拟过程为伪人生成了超过 900GB 的模拟人口普查、调查和行政数据,代表了数亿个模拟人。现在,数以千计的模拟人样本的模拟人口对伪人包的所有用户开放,而数百万和数亿个模拟人的大规模模拟人口也可以通过 GitHub 在线请求获得。这些模拟人口数据是为伪人包设计的,该包还包括了向数据添加各种噪声的额外功能,为 ER 研究人员提供了现实、可共享的挑战。