Vanderbilt University, Nashville, TN.
Vanderbilt University Medical Center, Nashville, TN.
AMIA Annu Symp Proc. 2022 Feb 21;2021:793-802. eCollection 2021.
Numerous studies have shown that a person's health status is closely related to their socioeconomic status. It is evident that incorporating socioeconomic data associated with a patient's geographic area of residence into clinical datasets will promote medical research. However, most socioeconomic variables are unique in combination and are affiliated with small geographical regions (e.g., census tracts) that are often associated with less than 20,000 people. Thus, sharing such tract-level data can violate the Safe Harbor implementation of de-identification under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). In this paper, we introduce a constraint-based k-means clustering approach to generate census tract-level socioeconomic data that is de-identification compliant. Our experimental analysis with data from the American Community Survey illustrates that the approach generates a protected dataset with high similarity to the unaltered values, and achieves a substantially better data utility than the HIPAA Safe Harbor recommendation of 3-digit ZIP code.
大量研究表明,一个人的健康状况与其社会经济地位密切相关。显然,将与患者居住地理区域相关的社会经济数据纳入临床数据集将促进医学研究。然而,大多数社会经济变量在组合上是独特的,并且与小的地理区域(例如,人口普查区)相关联,这些区域通常与不到 20000 人相关联。因此,共享此类区域级数据可能会违反 1996 年《健康保险携带和责任法案》(HIPAA)的安全港实施的去识别。在本文中,我们介绍了一种基于约束的 k-均值聚类方法来生成符合去识别要求的人口普查区社会经济数据。我们使用美国社区调查数据进行的实验分析表明,该方法生成的受保护数据集与原始值高度相似,并且比 HIPAA 安全港建议的 3 位邮政编码具有更高的数据实用性。