Rineer James, Kruskamp Nicholas, Kery Caroline, Jones Kasey, Hilscher Rainer, Bobashev Georgiy
RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA.
Sci Data. 2025 Jan 25;12(1):144. doi: 10.1038/s41597-025-04380-7.
Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015-2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson's r correlation coefficient.
地理空间明确且统计准确的个人和家庭数据使研究人员能够研究社区和邻里层面的影响,并设计和检验假设,否则在没有生成合成数据的情况下这些假设是无法实现的。在本文中,我们展示了为代表2019年的美国生成地理空间明确的家庭和个人层面合成人口的工作流程。我们使用了2015 - 2019年美国人口普查美国社区调查(ACS)的公开可用5年估计数据。我们使用迭代比例拟合(IPF)来创建我们的合成人口,并使用所得的联合计数直接从微观数据中对有代表性的家庭和个人进行抽样。我们的数据集包含美国各地120,754,708个家庭和303,128,287个人的记录。我们使用美国环境保护局(EPA)综合气候和土地利用情景(ICLUS)项目的家庭分布估计来对家庭进行空间分配,以创建一个地理空间明确的数据集。我们的验证显示与原始普查变量有很强的相关性,许多类别报告的皮尔逊r相关系数大于0.99。