Zuo Zheming, Watson Matthew, Budgen David, Hall Robert, Kennelly Chris, Al Moubayed Noura
Department of Computer Science, Durham University, Durham, United Kingdom.
Cievert Ltd, Newcastle upon Tyne, United Kingdom.
JMIR Med Inform. 2021 Oct 15;9(10):e29871. doi: 10.2196/29871.
Data science offers an unparalleled opportunity to identify new insights into many aspects of human life with recent advances in health care. Using data science in digital health raises significant challenges regarding data privacy, transparency, and trustworthiness. Recent regulations enforce the need for a clear legal basis for collecting, processing, and sharing data, for example, the European Union's General Data Protection Regulation (2016) and the United Kingdom's Data Protection Act (2018). For health care providers, legal use of the electronic health record (EHR) is permitted only in clinical care cases. Any other use of the data requires thoughtful considerations of the legal context and direct patient consent. Identifiable personal and sensitive information must be sufficiently anonymized. Raw data are commonly anonymized to be used for research purposes, with risk assessment for reidentification and utility. Although health care organizations have internal policies defined for information governance, there is a significant lack of practical tools and intuitive guidance about the use of data for research and modeling. Off-the-shelf data anonymization tools are developed frequently, but privacy-related functionalities are often incomparable with regard to use in different problem domains. In addition, tools to support measuring the risk of the anonymized data with regard to reidentification against the usefulness of the data exist, but there are question marks over their efficacy.
In this systematic literature mapping study, we aim to alleviate the aforementioned issues by reviewing the landscape of data anonymization for digital health care.
We used Google Scholar, Web of Science, Elsevier Scopus, and PubMed to retrieve academic studies published in English up to June 2020. Noteworthy gray literature was also used to initialize the search. We focused on review questions covering 5 bottom-up aspects: basic anonymization operations, privacy models, reidentification risk and usability metrics, off-the-shelf anonymization tools, and the lawful basis for EHR data anonymization.
We identified 239 eligible studies, of which 60 were chosen for general background information; 16 were selected for 7 basic anonymization operations; 104 covered 72 conventional and machine learning-based privacy models; four and 19 papers included seven and 15 metrics, respectively, for measuring the reidentification risk and degree of usability; and 36 explored 20 data anonymization software tools. In addition, we also evaluated the practical feasibility of performing anonymization on EHR data with reference to their usability in medical decision-making. Furthermore, we summarized the lawful basis for delivering guidance on practical EHR data anonymization.
This systematic literature mapping study indicates that anonymization of EHR data is theoretically achievable; yet, it requires more research efforts in practical implementations to balance privacy preservation and usability to ensure more reliable health care applications.
随着医疗保健领域的最新进展,数据科学为洞察人类生活的诸多方面提供了前所未有的机会。在数字健康领域使用数据科学在数据隐私、透明度和可信度方面带来了重大挑战。最近的法规强调了在收集、处理和共享数据方面需要有明确的法律依据,例如欧盟的《通用数据保护条例》(2016年)和英国的《数据保护法》(2018年)。对于医疗保健提供者而言,仅在临床护理情况下才允许合法使用电子健康记录(EHR)。对数据的任何其他使用都需要认真考虑法律背景并获得患者的直接同意。可识别的个人敏感信息必须充分匿名化。原始数据通常会被匿名化以用于研究目的,并进行重新识别风险和实用性的风险评估。尽管医疗保健组织已制定了内部信息治理政策,但在将数据用于研究和建模方面,严重缺乏实用工具和直观指导。现成的数据匿名化工具经常被开发出来,但隐私相关功能在不同问题领域的使用方面往往无法比较。此外,存在用于支持衡量匿名数据重新识别风险与数据有用性的工具,但它们的有效性存在疑问。
在这项系统文献映射研究中,我们旨在通过回顾数字医疗保健数据匿名化的现状来缓解上述问题。
我们使用谷歌学术、科学网、爱思唯尔Scopus和PubMed检索截至2020年6月以英文发表的学术研究。还使用了值得注意的灰色文献来启动搜索。我们专注于涵盖5个自下而上方面的综述问题:基本匿名化操作、隐私模型、重新识别风险和可用性指标、现成的匿名化工具以及EHR数据匿名化的法律依据。
我们确定了239项符合条件的研究,其中60项被选用于一般背景信息;16项被选用于7种基本匿名化操作;104项涵盖了72种传统和基于机器学习的隐私模型;4篇和19篇论文分别包含7种和15种用于衡量重新识别风险和可用性程度的指标;36项研究探索了20种数据匿名化软件工具。此外,我们还参照EHR数据在医疗决策中的可用性评估了对其进行匿名化处理的实际可行性。此外,我们总结了为实际EHR数据匿名化提供指导的法律依据。
这项系统文献映射研究表明,EHR数据的匿名化在理论上是可以实现的;然而,在实际实施中需要更多的研究努力来平衡隐私保护和可用性,以确保更可靠的医疗保健应用。