Silent Spring Institute, Newton, Massachusetts, USA.
MIT Media Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Environ Health Perspect. 2020 Jan;128(1):17008. doi: 10.1289/EHP4817. Epub 2020 Jan 10.
Sharing research data uses resources effectively; enables large, diverse data sets; and supports rigor and reproducibility. However, sharing such data increases privacy risks for participants who may be re-identified by linking study data to outside data sets. These risks have been investigated for genetic and medical records but rarely for environmental data.
We evaluated how data in environmental health (EH) studies may be vulnerable to linkage and we investigated, in a case study, whether environmental measurements could contribute to inferring latent categories (e.g., geographic location), which increases privacy risks.
We identified 12 prominent EH studies, reviewed the data types collected, and evaluated the availability of outside data sets that overlap with study data. With data from the Household Exposure Study in California and Massachusetts and the Green Housing Study in Boston, Massachusetts, and Cincinnati, Ohio, we used -means clustering and principal component analysis to investigate whether participants' region of residence could be inferred from measurements of chemicals in household air and dust.
All 12 studies included at least two of five data types that overlap with outside data sets: geographic location (9 studies), medical data (9 studies), occupation (10 studies), housing characteristics (10 studies), and genetic data (7 studies). In our cluster analysis, participants' region of residence could be inferred with 80%-98% accuracy using environmental measurements with original laboratory reporting limits.
EH studies frequently include data that are vulnerable to linkage with voter lists, tax and real estate data, professional licensing lists, and ancestry websites, and exposure measurements may be used to identify subgroup membership, increasing likelihood of linkage. Thus, unsupervised sharing of EH research data potentially raises substantial privacy risks. Empirical research can help characterize risks and evaluate technical solutions. Our findings reinforce the need for legal and policy protections to shield participants from potential harms of re-identification from data sharing. https://doi.org/10.1289/EHP4817.
分享研究数据可以有效地利用资源;支持大规模、多样化的数据集;并支持严谨性和可重复性。然而,共享此类数据会增加参与者的隐私风险,因为通过将研究数据与外部数据集进行链接,参与者可能会被重新识别。这些风险已经在遗传和医疗记录中进行了调查,但很少在环境数据中进行调查。
我们评估了环境健康 (EH) 研究中的数据可能容易受到链接的程度,并在案例研究中调查了环境测量是否可能有助于推断潜在类别(例如地理位置),这会增加隐私风险。
我们确定了 12 项著名的 EH 研究,审查了收集的数据类型,并评估了与研究数据重叠的外部数据集的可用性。使用加利福尼亚州和马萨诸塞州的家庭暴露研究以及马萨诸塞州波士顿和俄亥俄州辛辛那提的绿色住房研究的数据,我们使用 -均值聚类和主成分分析来调查是否可以从家庭空气中和灰尘中化学物质的测量值推断出参与者的居住地区。
所有 12 项研究都至少包含与外部数据集重叠的五种数据类型中的两种:地理位置(9 项研究)、医疗数据(9 项研究)、职业(10 项研究)、住房特征(10 项研究)和遗传数据(7 项研究)。在我们的聚类分析中,使用原始实验室报告限下的环境测量值可以以 80%-98%的准确度推断出参与者的居住地区。
EH 研究经常包含易与选民名单、税务和房地产数据、专业执照名单以及祖先网站链接的数据,并且暴露测量值可用于识别亚组成员,增加链接的可能性。因此,EH 研究数据的无监督共享可能会带来重大的隐私风险。实证研究可以帮助描述风险并评估技术解决方案。我们的研究结果强化了需要法律和政策保护,以保护参与者免受数据共享可能带来的重新识别的潜在伤害。https://doi.org/10.1289/EHP4817.