Dalianis Hercules, Velupillai Sumithra
Department of Computer and Systems Sciences, (DSV), Stockholm University Forum 100, 164 40 Kista, Sweden.
J Biomed Semantics. 2010 Apr 12;1(1):6. doi: 10.1186/2041-1480-1-6.
In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.
We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.
Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.
为了对电子病历(EPR)中包含的信息进行研究,需要获取数据本身。由于保密规定,这通常非常困难。在将数据集分发给研究人员之前,需要对其进行完全去识别处理。去识别是一项艰巨的任务,其中注释类别的定义并不显而易见。
我们展示了为去识别创建手动注释黄金标准的两个细化变体的工作,一个是自动创建的,另一个是通过注释者之间的讨论创建的。数据是斯德哥尔摩EPR语料库的一个子集,该语料库是我们研究小组可用的数据集。这些数据用于基于条件随机场算法的自动系统的训练和评估。在大约4000 - 6000个注释实例集上进行四倍交叉验证评估时,我们为两个黄金标准都获得了非常有前景的结果:在许多实验中F分数约为0.80,某些注释类别的结果更高。此外,系统发现了49个被验证为真阳性的误报,而注释者却遗漏了这些。
我们打算在未来将这个黄金标准,即斯德哥尔摩EPR个人健康信息语料库,提供给其他研究小组。尽管稍微耗时一些,但我们认为手动达成共识的黄金标准对进一步研究最有价值。我们还提出了一组用于类似去识别任务的注释类别。