对瑞典临床文本进行去识别处理——完善金标准并进行条件随机场实验。

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields.

作者信息

Dalianis Hercules, Velupillai Sumithra

机构信息

Department of Computer and Systems Sciences, (DSV), Stockholm University Forum 100, 164 40 Kista, Sweden.

出版信息

J Biomed Semantics. 2010 Apr 12;1(1):6. doi: 10.1186/2041-1480-1-6.

DOI:10.1186/2041-1480-1-6

PMID:20618985

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2895734/

Abstract

BACKGROUND

In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.

RESULTS

We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.

CONCLUSIONS

Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

摘要

背景

为了对电子病历（EPR）中包含的信息进行研究，需要获取数据本身。由于保密规定，这通常非常困难。在将数据集分发给研究人员之前，需要对其进行完全去识别处理。去识别是一项艰巨的任务，其中注释类别的定义并不显而易见。

结果

我们展示了为去识别创建手动注释黄金标准的两个细化变体的工作，一个是自动创建的，另一个是通过注释者之间的讨论创建的。数据是斯德哥尔摩EPR语料库的一个子集，该语料库是我们研究小组可用的数据集。这些数据用于基于条件随机场算法的自动系统的训练和评估。在大约4000 - 6000个注释实例集上进行四倍交叉验证评估时，我们为两个黄金标准都获得了非常有前景的结果：在许多实验中F分数约为0.80，某些注释类别的结果更高。此外，系统发现了49个被验证为真阳性的误报，而注释者却遗漏了这些。

结论

我们打算在未来将这个黄金标准，即斯德哥尔摩EPR个人健康信息语料库，提供给其他研究小组。尽管稍微耗时一些，但我们认为手动达成共识的黄金标准对进一步研究最有价值。我们还提出了一组用于类似去识别任务的注释类别。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

对瑞典临床文本进行去识别处理——完善金标准并进行条件随机场实验。

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

对瑞典临床文本进行去识别处理——完善金标准并进行条件随机场实验。

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献