Suppr超能文献

在大语言模型时代,为实现有效的文本匿名化,人与机器展开激烈竞争。

Man vs the machine in the struggle for effective text anonymisation in the age of large language models.

机构信息

Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou str, 18534, Piraeus, Greece.

Management Systems Institute of Athena Research Centre, Marousi, Greece.

出版信息

Sci Rep. 2023 Sep 25;13(1):16026. doi: 10.1038/s41598-023-42977-3.

Abstract

The collection and use of personal data are becoming more common in today's data-driven culture. While there are many advantages to this, including better decision-making and service delivery, it also poses significant ethical issues around confidentiality and privacy. Text anonymisation tries to prune and/or mask identifiable information from a text while keeping the remaining content intact to alleviate privacy concerns. Text anonymisation is especially important in industries like healthcare, law, as well as research, where sensitive and personal information is collected, processed, and exchanged under high legal and ethical standards. Although text anonymisation is widely adopted in practice, it continues to face considerable challenges. The most significant challenge is striking a balance between removing information to protect individuals' privacy while maintaining the text's usability for future purposes. The question is whether these anonymisation methods sufficiently reduce the risk of re-identification, in which an individual can be identified based on the remaining information in the text. In this work, we challenge the effectiveness of these methods and how we perceive identifiers. We assess the efficacy of these methods against the elephant in the room, the use of AI over big data. While most of the research is focused on identifying and removing personal information, there is limited discussion on whether the remaining information is sufficient to deanonymise individuals and, more precisely, who can do it. To this end, we conduct an experiment using GPT over anonymised texts of famous people to determine whether such trained networks can deanonymise them. The latter allows us to revise these methods and introduce a novel methodology that employs Large Language Models to improve the anonymity of texts.

摘要

在当今以数据为驱动的文化中,个人数据的收集和使用变得越来越普遍。虽然这有很多好处,包括更好的决策和服务提供,但也带来了重大的保密性和隐私性的伦理问题。文本匿名化试图在保留文本其余内容不变的情况下,从文本中修剪和/或屏蔽可识别的信息,以减轻隐私问题。文本匿名化在医疗保健、法律以及研究等行业尤为重要,这些行业会根据高法律和道德标准收集、处理和交换敏感和个人信息。尽管文本匿名化在实践中得到了广泛应用,但它仍然面临着相当大的挑战。最大的挑战是在保护个人隐私的同时,在信息删除和文本可用性之间取得平衡,以满足未来的需求。问题是这些匿名化方法是否足以降低重新识别的风险,即个人是否可以基于文本中剩余的信息来识别。在这项工作中,我们对这些方法的有效性以及我们对标识符的看法提出了质疑。我们评估了这些方法对当前的主要挑战,即人工智能对大数据的使用的有效性。虽然大多数研究都集中在识别和删除个人信息上,但对于剩余信息是否足以对个人进行去匿名化,以及更确切地说,谁可以做到这一点,讨论是有限的。为此,我们使用 GPT 对名人的匿名化文本进行了实验,以确定这些经过训练的网络是否可以对其进行去匿名化。后者使我们能够修改这些方法并引入一种新颖的方法,该方法使用大型语言模型来提高文本的匿名性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/43bc/10519977/4497accbacfd/41598_2023_42977_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验