Suppr超能文献

使用基于树的方法和夏普利值对个人信息去识别技术的验证

Verification of De-Identification Techniques for Personal Information Using Tree-Based Methods with Shapley Values.

作者信息

Lee Junhak, Jeong Jinwoo, Jung Sungji, Moon Jihoon, Rho Seungmin

机构信息

Department of Industrial Security, Chung-Ang University, Seoul 06974, Korea.

出版信息

J Pers Med. 2022 Jan 31;12(2):190. doi: 10.3390/jpm12020190.

Abstract

With the development of big data and cloud computing technologies, the importance of pseudonym information has grown. However, the tools for verifying whether the de-identification methodology is correctly applied to ensure data confidentiality and usability are insufficient. This paper proposes a verification of de-identification techniques for personal healthcare information by considering data confidentiality and usability. Data are generated and preprocessed by considering the actual statistical data, personal information datasets, and de-identification datasets based on medical data to represent the de-identification technique as a numeric dataset. Five tree-based regression models (i.e., decision tree, random forest, gradient boosting machine, extreme gradient boosting, and light gradient boosting machine) are constructed using the de-identification dataset to effectively discover nonlinear relationships between dependent and independent variables in numerical datasets. Then, the most effective model is selected from personal information data in which pseudonym processing is essential for data utilization. The Shapley additive explanation, an explainable artificial intelligence technique, is applied to the most effective model to establish pseudonym processing policies and machine learning to present a machine-learning process that selects an appropriate de-identification methodology.

摘要

随着大数据和云计算技术的发展,假名信息的重要性日益凸显。然而,用于验证去识别方法是否正确应用以确保数据保密性和可用性的工具却并不充分。本文通过考虑数据保密性和可用性,提出了一种针对个人医疗保健信息的去识别技术验证方法。通过考虑实际统计数据、个人信息数据集以及基于医疗数据的去识别数据集来生成和预处理数据,将去识别技术表示为一个数值数据集。使用去识别数据集构建五个基于树的回归模型(即决策树、随机森林、梯度提升机、极端梯度提升和轻梯度提升机),以有效发现数值数据集中因变量和自变量之间的非线性关系。然后,从个人信息数据中选择最有效的模型,其中假名处理对于数据利用至关重要。将可解释人工智能技术——Shapley 加法解释应用于最有效的模型,以建立假名处理策略和机器学习,从而呈现一个选择合适去识别方法的机器学习过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b65/8877642/1ba80ef58b0e/jpm-12-00190-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验