Suppr超能文献

对瑞典临床文本进行去识别处理——完善金标准并进行条件随机场实验。

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields.

作者信息

Dalianis Hercules, Velupillai Sumithra

机构信息

Department of Computer and Systems Sciences, (DSV), Stockholm University Forum 100, 164 40 Kista, Sweden.

出版信息

J Biomed Semantics. 2010 Apr 12;1(1):6. doi: 10.1186/2041-1480-1-6.

Abstract

BACKGROUND

In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.

RESULTS

We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.

CONCLUSIONS

Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

摘要

背景

为了对电子病历(EPR)中包含的信息进行研究,需要获取数据本身。由于保密规定,这通常非常困难。在将数据集分发给研究人员之前,需要对其进行完全去识别处理。去识别是一项艰巨的任务,其中注释类别的定义并不显而易见。

结果

我们展示了为去识别创建手动注释黄金标准的两个细化变体的工作,一个是自动创建的,另一个是通过注释者之间的讨论创建的。数据是斯德哥尔摩EPR语料库的一个子集,该语料库是我们研究小组可用的数据集。这些数据用于基于条件随机场算法的自动系统的训练和评估。在大约4000 - 6000个注释实例集上进行四倍交叉验证评估时,我们为两个黄金标准都获得了非常有前景的结果:在许多实验中F分数约为0.80,某些注释类别的结果更高。此外,系统发现了49个被验证为真阳性的误报,而注释者却遗漏了这些。

结论

我们打算在未来将这个黄金标准,即斯德哥尔摩EPR个人健康信息语料库,提供给其他研究小组。尽管稍微耗时一些,但我们认为手动达成共识的黄金标准对进一步研究最有价值。我们还提出了一组用于类似去识别任务的注释类别。

相似文献

4
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.

引用本文的文献

5
The OpenDeID corpus for patient de-identification.OpenDeID 患者去识别语料库。
Sci Rep. 2021 Oct 7;11(1):19973. doi: 10.1038/s41598-021-99554-9.

本文引用的文献

3
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.
4
A de-identifier for medical discharge summaries.一份用于出院小结的去标识信息。
Artif Intell Med. 2008 Jan;42(1):13-35. doi: 10.1016/j.artmed.2007.10.001. Epub 2007 Nov 28.
5
Evaluating the state-of-the-art in automatic de-identification.评估自动去识别技术的最新进展。
J Am Med Inform Assoc. 2007 Sep-Oct;14(5):550-63. doi: 10.1197/jamia.M2444. Epub 2007 Jun 28.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验