COVID-19 Response, Centers for Disease Control and Prevention, Atlanta, GA, USA.
Office of the Chief Operations Officer, Office of the Chief Information Officer, Centers for Disease Control and Prevention, Atlanta, GA, USA.
Public Health Rep. 2021 Sep-Oct;136(5):554-561. doi: 10.1177/00333549211026817. Epub 2021 Jun 17.
Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data.
We included data elements based on usefulness, public request, and privacy implications; we suppressed some field values to reduce the risk of re-identification and exposure of confidential information. We created datasets and verified them for privacy and confidentiality by using data management platform analytic tools and R scripts.
Unrestricted data are available to the public through Data.CDC.gov, and restricted data, with additional fields, are available with a data-use agreement through a private repository on GitHub.com.
Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect the privacy of de-identified people allow for improved data use. Automating data-generation procedures improves the volume and timeliness of sharing data.
促进联邦政府收集的数据共享的联邦公开数据倡议对于透明度、数据质量、信任以及与公众和州、部落、地方和地区合作伙伴的关系非常重要。这些倡议通过为研究人员、科学家和政策制定者提供数据进行分析、合作和在疾病预防控制中心(CDC)之外使用,推进了对健康状况和疾病的理解,特别是对于 COVID-19 等新兴状况,其数据需求不断发展。自大流行开始以来,CDC 从各管辖区收集了个人层面的、去识别化的数据,目前拥有超过 800 万条记录。我们描述了 CDC 如何从这些收集的数据中设计和生成 2 个去识别化的公共数据集。
我们根据有用性、公众请求和隐私影响包含了数据元素;我们抑制了一些字段值,以降低重新识别和机密信息暴露的风险。我们通过使用数据管理平台分析工具和 R 脚本创建数据集,并对其进行隐私和保密性验证。
不受限制的数据可通过 Data.CDC.gov 向公众提供,而受限制的数据(具有附加字段)则可通过 GitHub.com 上的私人存储库在达成数据使用协议后提供。
对可用公共数据、创建这些数据所用的方法以及用于保护去识别化人员隐私的算法有更深入的了解,可以改进数据使用。自动化数据生成程序可以提高数据共享的数量和及时性。