Song Xing, Waitman Lemuel R, Hu Yong, Luo Bo, Li Fengjun, Liu Mei
University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA.
Jinan University, Big Data Decision Institute, Guangzhou, PRC.
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:617-625. eCollection 2020.
Artificial intelligence enabled medical big data analysis has the potential to revolutionize medical practice from diagnosis and prediction of complex diseases to making recommendations and resource allocation decisions in an evidence-based manner. However, big data comes with big disclosure risks. To preserve privacy, excessive data anonymization is often necessary, leading to significant loss of data utility. In this paper, we develop a systematic data scrubbing procedure for large datasets when key variables are uncertain for re-identification risk assessment and assess the trade-off between anonymization of electronic health record data for sharing in support of open science and performance of machine learning models for early acute kidney injury risk prediction using the data. Results demonstrate that our proposed data scrubbing procedure can maintain good feature diversity and moderate data utility but raises concerns regarding its impact on knowledge discovery capability.
人工智能驱动的医学大数据分析有潜力彻底改变医疗实践,从复杂疾病的诊断和预测到以循证方式做出推荐和资源分配决策。然而,大数据伴随着巨大的披露风险。为了保护隐私,往往需要进行过度的数据匿名化处理,这会导致数据效用的显著损失。在本文中,当关键变量对于重新识别风险评估不确定时,我们为大型数据集开发了一种系统的数据清理程序,并评估了用于支持开放科学而共享的电子健康记录数据匿名化与使用这些数据进行早期急性肾损伤风险预测的机器学习模型性能之间的权衡。结果表明,我们提出的数据清理程序可以保持良好的特征多样性和适度的数据效用,但引发了对其对知识发现能力影响的担忧。