Suppr超能文献

OpenDeID 患者去识别语料库。

The OpenDeID corpus for patient de-identification.

机构信息

School of Population Health, UNSW Sydney, Sydney, Australia.

School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia.

出版信息

Sci Rep. 2021 Oct 7;11(1):19973. doi: 10.1038/s41598-021-99554-9.

Abstract

For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.

摘要

出于研究目的,通常会从非结构化的电子健康记录中删除受保护的健康信息,以保护患者的隐私和机密性。OpenDeID 语料库旨在协助开发从非结构化的电子健康记录中自动删除敏感信息的方法。我们从四家澳大利亚城市医院检索了 4548 份非结构化的外科病理报告。该语料库由两名注释员在三种不同的实验设置下开发。针对每种设置都对注释质量进行了评估。具体而言,我们采用了串行注释、并行注释和预注释。我们的结果表明,与串行注释相比,预注释方法在质量方面不可靠,但可以大大减少注释时间。OpenDeID 语料库包含 1833 名癌症患者的 2100 份病理报告,每份报告的平均标记数为 737.49 个,平均标注 7.35 个受保护的健康信息实体。总体而言,注释者之间的一致性和偏差分数分别为 0.9464 和 0.9726。还生成了现实的替代物,以使语料库适合分发给其他研究人员。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/07d5/8497517/74dca411fb49/41598_2021_99554_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验