OpenDeID 患者去识别语料库。

The OpenDeID corpus for patient de-identification.

机构信息

School of Population Health, UNSW Sydney, Sydney, Australia.

School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia.

出版信息

Sci Rep. 2021 Oct 7;11(1):19973. doi: 10.1038/s41598-021-99554-9.

DOI:10.1038/s41598-021-99554-9

PMID:34620985

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8497517/

Abstract

For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.

摘要

出于研究目的，通常会从非结构化的电子健康记录中删除受保护的健康信息，以保护患者的隐私和机密性。OpenDeID 语料库旨在协助开发从非结构化的电子健康记录中自动删除敏感信息的方法。我们从四家澳大利亚城市医院检索了 4548 份非结构化的外科病理报告。该语料库由两名注释员在三种不同的实验设置下开发。针对每种设置都对注释质量进行了评估。具体而言，我们采用了串行注释、并行注释和预注释。我们的结果表明，与串行注释相比，预注释方法在质量方面不可靠，但可以大大减少注释时间。OpenDeID 语料库包含 1833 名癌症患者的 2100 份病理报告，每份报告的平均标记数为 737.49 个，平均标注 7.35 个受保护的健康信息实体。总体而言，注释者之间的一致性和偏差分数分别为 0.9464 和 0.9726。还生成了现实的替代物，以使语料库适合分发给其他研究人员。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/07d5/8497517/74dca411fb49/41598_2021_99554_Fig1_HTML.jpg

相似文献

The OpenDeID corpus for patient de-identification.OpenDeID 患者去识别语料库。

Sci Rep. 2021 Oct 7;11(1):19973. doi: 10.1038/s41598-021-99554-9.

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study.基于规则和转换器的非结构化电子健康记录文本注释的 OpenDeID 管道：去识别算法的开发和验证研究。

J Med Internet Res. 2023 Dec 6;25:e48145. doi: 10.2196/48145.

Generation of Surrogates for De-Identification of Electronic Health Records.用于电子健康记录去识别化的替代物生成

Stud Health Technol Inform. 2019 Aug 21;264:70-73. doi: 10.3233/SHTI190185.

Preliminary Evaluation of Fine-Tuning the OpenDeLD Deidentification Pipeline Across Multi-Center Corpora.多中心语料库中微调 OpenDeLD 去识别管道的初步评估。

Stud Health Technol Inform. 2024 Aug 22;316:719-723. doi: 10.3233/SHTI240515.

De-identification of clinical notes in French: towards a protocol for reference corpus development.法语临床记录的去识别化：迈向参考语料库开发协议

J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.

Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.评估机器预标注和交互式标注界面在临床文本人工去识别化方面的效果。

J Biomed Inform. 2014 Aug;50:162-72. doi: 10.1016/j.jbi.2014.05.002. Epub 2014 May 20.

De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus.去标识化 GRASCCO - 德国医疗文本项目（GeMTeX）语料库去标识化的初步研究。

Stud Health Technol Inform. 2024 Aug 30;317:171-179. doi: 10.3233/SHTI240853.

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.自动去除法国电子健康记录中的标识符：一种利用远程监督和深度学习模型的具有成本效益的方法。

BMC Med Inform Decis Mak. 2024 Feb 16;24(1):54. doi: 10.1186/s12911-024-02422-5.

Annotating German Clinical Documents for De-Identification.为去识别化标注德国临床文档。

Stud Health Technol Inform. 2019 Aug 21;264:203-207. doi: 10.3233/SHTI190212.

A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。

BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

引用本文的文献

Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records.利用大语言模型对电子健康记录中的敏感健康信息进行去识别化处理和时间标准化。

NPJ Digit Med. 2025 Aug 13;8(1):517. doi: 10.1038/s41746-025-01921-7.

Privacy preserving strategies for electronic health records in the era of large language models.大语言模型时代电子健康记录的隐私保护策略

NPJ Digit Med. 2025 Jan 16;8(1):34. doi: 10.1038/s41746-025-01429-0.

Automated redaction of names in adverse event reports using transformer-based neural networks.使用基于Transformer的神经网络对不良事件报告中的姓名进行自动编辑。

BMC Med Inform Decis Mak. 2024 Dec 23;24(1):401. doi: 10.1186/s12911-024-02785-9.

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study.揭开高级人工智能语言模型在去识别汉英混合临床文本背后的秘密：开发与验证研究。

J Med Internet Res. 2024 Jan 25;26:e48443. doi: 10.2196/48443.

J Med Internet Res. 2023 Dec 6;25:e48145. doi: 10.2196/48145.

Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record.从电子健康记录中的临床记录中开发开源标注青光眼药物数据集。

Transl Vis Sci Technol. 2022 Nov 1;11(11):20. doi: 10.1167/tvst.11.11.20.

本文引用的文献

Moving with the Times: The Health Science Alliance (HSA) Biobank, Pathway to Sustainability.与时俱进：健康科学联盟（HSA）生物样本库，可持续发展之路。

Biomark Insights. 2021 Mar 27;16:11772719211005745. doi: 10.1177/11772719211005745. eCollection 2021.

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine.一个用统一医学语言系统（UMLS）实体注释的临床试验语料库，以加强对循证医学的获取。

BMC Med Inform Decis Mak. 2021 Feb 22;21(1):69. doi: 10.1186/s12911-021-01395-z.

De-identification of electronic health record using neural network.使用神经网络对电子健康记录进行去识别化。

Sci Rep. 2020 Oct 29;10(1):18600. doi: 10.1038/s41598-020-75544-1.

Chia, a large annotated corpus of clinical trial eligibility criteria.柴亚，一个大型的临床试验资格标准注释语料库。

Sci Data. 2020 Aug 27;7(1):281. doi: 10.1038/s41597-020-00620-0.

Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes.受保护的健康信息过滤器（Philter）：准确且安全地去除自由文本临床记录中的身份标识信息。

NPJ Digit Med. 2020 Apr 14;3:57. doi: 10.1038/s41746-020-0258-y. eCollection 2020.

Electronic health records and polygenic risk scores for predicting disease risk.电子健康记录和多基因风险评分用于预测疾病风险。

Nat Rev Genet. 2020 Aug;21(8):493-502. doi: 10.1038/s41576-020-0224-1. Epub 2020 Mar 31.

Status Update on Data Required to Build a Learning Health System.构建学习型健康系统所需数据的状态更新

J Clin Oncol. 2020 May 10;38(14):1602-1607. doi: 10.1200/JCO.19.03094. Epub 2020 Mar 25.

Customization scenarios for de-identification of clinical notes.临床记录去识别的定制化场景。

BMC Med Inform Decis Mak. 2020 Jan 30;20(1):14. doi: 10.1186/s12911-020-1026-2.

PGxCorpus, a manually annotated corpus for pharmacogenomics.PGxCorpus，一个用于药物基因组学的人工标注语料库。

Sci Data. 2020 Jan 2;7(1):3. doi: 10.1038/s41597-019-0342-9.

Comparison of the cohort selection performance of Australian Medicines Terminology to Anatomical Therapeutic Chemical mappings.澳大利亚药物术语与解剖治疗学化学映射的队列选择性能比较。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1237-1246. doi: 10.1093/jamia/ocz143.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

OpenDeID 患者去识别语料库。

The OpenDeID corpus for patient de-identification.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献