Lin Wen-Yang, Yang Duen-Chuan, Wang Jie-Teng
Department of Computer Science and Information Engineering, National University of Kaohsiung, Nanzih District, Kaohsiung, 811, Taiwan, R.O.C.
BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):58. doi: 10.1186/s12911-016-0293-4.
To facilitate long-term safety surveillance of marketing drugs, many spontaneously reporting systems (SRSs) of ADR events have been established world-wide. Since the data collected by SRSs contain sensitive personal health information that should be protected to prevent the identification of individuals, it procures the issue of privacy preserving data publishing (PPDP), that is, how to sanitize (anonymize) raw data before publishing. Although much work has been done on PPDP, very few studies have focused on protecting privacy of SRS data and none of the anonymization methods is favorable for SRS datasets, due to which contain some characteristics such as rare events, multiple individual records, and multi-valued sensitive attributes.
We propose a new privacy model called MS(k, θ () )-bounding for protecting published spontaneous ADE reporting data from privacy attacks. Our model has the flexibility of varying privacy thresholds, i.e., θ () , for different sensitive values and takes the characteristics of SRS data into consideration. We also propose an anonymization algorithm for sanitizing the raw data to meet the requirements specified through the proposed model. Our algorithm adopts a greedy-based clustering strategy to group the records into clusters, conforming to an innovative anonymization metric aiming to minimize the privacy risk as well as maintain the data utility for ADR detection. Empirical study was conducted using FAERS dataset from 2004Q1 to 2011Q4. We compared our model with four prevailing methods, including k-anonymity, (X, Y)-anonymity, Multi-sensitive l-diversity, and (α, k)-anonymity, evaluated via two measures, Danger Ratio (DR) and Information Loss (IL), and considered three different scenarios of threshold setting for θ (*) , including uniform setting, level-wise setting and frequency-based setting. We also conducted experiments to inspect the impact of anonymized data on the strengths of discovered ADR signals.
With all three different threshold settings for sensitive value, our method can successively prevent the disclosure of sensitive values (nearly all observed DRs are zeros) without sacrificing too much of data utility. With non-uniform threshold setting, level-wise or frequency-based, our MS(k, θ (*))-bounding exhibits the best data utility and the least privacy risk among all the models. The experiments conducted on selected ADR signals from MedWatch show that only very small difference on signal strength (PRR or ROR) were observed. The results show that our method can effectively prevent the disclosure of patient sensitive information without sacrificing data utility for ADR signal detection.
We propose a new privacy model for protecting SRS data that possess some characteristics overlooked by contemporary models and an anonymization algorithm to sanitize SRS data in accordance with the proposed model. Empirical evaluation on the real SRS dataset, i.e., FAERS, shows that our method can effectively solve the privacy problem in SRS data without influencing the ADR signal strength.
为便于对上市药品进行长期安全性监测,全球已建立了许多药品不良反应(ADR)事件自发报告系统(SRSs)。由于SRSs收集的数据包含敏感的个人健康信息,应加以保护以防止个人身份被识别,这就产生了隐私保护数据发布(PPDP)问题,即如何在发布前对原始数据进行清理(匿名化)。尽管在PPDP方面已经做了很多工作,但很少有研究关注SRS数据的隐私保护,并且没有一种匿名化方法适用于SRS数据集,因为SRS数据集具有一些特征,如罕见事件、多个个体记录和多值敏感属性。
我们提出了一种新的隐私模型,称为MS(k, θ () )-边界模型,用于保护发布的自发ADR报告数据免受隐私攻击。我们的模型具有针对不同敏感值设置不同隐私阈值(即θ () )的灵活性,并考虑了SRS数据的特征。我们还提出了一种匿名化算法,用于清理原始数据,以满足通过所提出模型指定的要求。我们的算法采用基于贪心的聚类策略将记录分组为簇,符合一种创新的匿名化度量标准,旨在最小化隐私风险并保持ADR检测的数据效用。使用2004年第一季度至2011年第四季度的FAERS数据集进行了实证研究。我们将我们的模型与四种流行方法进行了比较,包括k-匿名、(X, Y)-匿名、多敏感l-多样性和(α, k)-匿名,通过危险比(DR)和信息损失(IL)两种度量进行评估,并考虑了θ (*) 的三种不同阈值设置场景,包括统一设置、分层设置和基于频率的设置。我们还进行了实验,以检查匿名化数据对发现的ADR信号强度的影响。
对于敏感值的所有三种不同阈值设置,我们的方法都可以连续防止敏感值的泄露(几乎所有观察到的DR值都为零),而不会牺牲太多数据效用。在非统一阈值设置(分层或基于频率)下,我们的MS(k, θ (*))-边界模型在所有模型中表现出最佳的数据效用和最小的隐私风险。对来自MedWatch的选定ADR信号进行的实验表明,在信号强度(PRR或ROR)上仅观察到非常小的差异。结果表明,我们的方法可以有效防止患者敏感信息的泄露,而不会牺牲ADR信号检测的数据效用。
我们提出了一种新的隐私模型,用于保护具有一些被当代模型忽视特征的SRS数据,以及一种根据所提出模型清理SRS数据的匿名化算法。对真实SRS数据集(即FAERS)的实证评估表明,我们的方法可以有效解决SRS数据中的隐私问题,而不会影响ADR信号强度。