• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

对瑞典临床文本进行去识别处理——完善金标准并进行条件随机场实验。

De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields.

作者信息

Dalianis Hercules, Velupillai Sumithra

机构信息

Department of Computer and Systems Sciences, (DSV), Stockholm University Forum 100, 164 40 Kista, Sweden.

出版信息

J Biomed Semantics. 2010 Apr 12;1(1):6. doi: 10.1186/2041-1480-1-6.

DOI:10.1186/2041-1480-1-6
PMID:20618985
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2895734/
Abstract

BACKGROUND

In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.

RESULTS

We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.

CONCLUSIONS

Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

摘要

背景

为了对电子病历(EPR)中包含的信息进行研究,需要获取数据本身。由于保密规定,这通常非常困难。在将数据集分发给研究人员之前,需要对其进行完全去识别处理。去识别是一项艰巨的任务,其中注释类别的定义并不显而易见。

结果

我们展示了为去识别创建手动注释黄金标准的两个细化变体的工作,一个是自动创建的,另一个是通过注释者之间的讨论创建的。数据是斯德哥尔摩EPR语料库的一个子集,该语料库是我们研究小组可用的数据集。这些数据用于基于条件随机场算法的自动系统的训练和评估。在大约4000 - 6000个注释实例集上进行四倍交叉验证评估时,我们为两个黄金标准都获得了非常有前景的结果:在许多实验中F分数约为0.80,某些注释类别的结果更高。此外,系统发现了49个被验证为真阳性的误报,而注释者却遗漏了这些。

结论

我们打算在未来将这个黄金标准,即斯德哥尔摩EPR个人健康信息语料库,提供给其他研究小组。尽管稍微耗时一些,但我们认为手动达成共识的黄金标准对进一步研究最有价值。我们还提出了一组用于类似去识别任务的注释类别。

相似文献

1
De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields.对瑞典临床文本进行去识别处理——完善金标准并进行条件随机场实验。
J Biomed Semantics. 2010 Apr 12;1(1):6. doi: 10.1186/2041-1480-1-6.
2
Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial.开发一种用于去除瑞典语电子病历中标识符的标准:手动和计算机化注释试验中的精度、召回率和 F 度量。
Int J Med Inform. 2009 Dec;78(12):e19-26. doi: 10.1016/j.ijmedinf.2009.04.005. Epub 2009 May 23.
3
De-identification of clinical notes in French: towards a protocol for reference corpus development.法语临床记录的去识别化:迈向参考语料库开发协议
J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.
4
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.
5
Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.评估机器预标注和交互式标注界面在临床文本人工去识别化方面的效果。
J Biomed Inform. 2014 Aug;50:162-72. doi: 10.1016/j.jbi.2014.05.002. Epub 2014 May 20.
6
Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.准备一个带注释的金标准语料库,以便与校外研究人员共享用于去识别化研究。
J Biomed Inform. 2014 Aug;50:173-183. doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.
7
A machine learning based approach to identify protected health information in Chinese clinical text.基于机器学习的方法识别中文临床文本中的保护健康信息。
Int J Med Inform. 2018 Aug;116:24-32. doi: 10.1016/j.ijmedinf.2018.05.010. Epub 2018 May 22.
8
Evaluating current automatic de-identification methods with Veteran's health administration clinical documents.评估退伍军人健康管理局临床文档中当前的自动去识别方法。
BMC Med Res Methodol. 2012 Jul 27;12:109. doi: 10.1186/1471-2288-12-109.
9
Inductive creation of an annotation schema and a reference standard for de-identification of VA electronic clinical notes.归纳创建用于对退伍军人事务部电子临床记录进行去识别处理的注释模式和参考标准。
AMIA Annu Symp Proc. 2009 Nov 14;2009:416-20.
10
Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.榨取成果是否值得?多名人工标注者在临床文本去识别化中的成本与收益
Methods Inf Med. 2016 Aug 5;55(4):356-64. doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.

引用本文的文献

1
A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish.西班牙语临床实体识别用假名化职业健康叙事语料库。
BMC Med Inform Decis Mak. 2024 Jul 24;24(1):204. doi: 10.1186/s12911-024-02609-w.
2
End-to-end pseudonymization of fine-tuned clinical BERT models : Privacy preservation with maintained data utility.端到端微调临床 BERT 模型的化名化:保持数据效用的隐私保护。
BMC Med Inform Decis Mak. 2024 Jun 12;24(1):162. doi: 10.1186/s12911-024-02546-8.
3
De-identifying Norwegian Clinical Text using Resources from Swedish and Danish.使用瑞典语和丹麦语资源对挪威临床文本进行去识别化处理
AMIA Annu Symp Proc. 2024 Jan 11;2023:456-464. eCollection 2023.
4
The Potential of Research Drawing on Clinical Free Text to Bring Benefits to Patients in the United Kingdom: A Systematic Review of the Literature.利用临床自由文本进行研究为英国患者带来益处的潜力:文献系统综述
Front Digit Health. 2021 Feb 10;3:606599. doi: 10.3389/fdgth.2021.606599. eCollection 2021.
5
The OpenDeID corpus for patient de-identification.OpenDeID 患者去识别语料库。
Sci Rep. 2021 Oct 7;11(1):19973. doi: 10.1038/s41598-021-99554-9.
6
De-identifying Spanish medical texts - named entity recognition applied to radiology reports.去识别西班牙语医学文本 - 命名实体识别在放射学报告中的应用。
J Biomed Semantics. 2021 Mar 29;12(1):6. doi: 10.1186/s13326-021-00236-2.
7
De-identifying free text of Japanese electronic health records.去标识化日本电子健康记录的自由文本。
J Biomed Semantics. 2020 Sep 21;11(1):11. doi: 10.1186/s13326-020-00227-9.
8
Clinical Natural Language Processing in languages other than English: opportunities and challenges.非英语语言的临床自然语言处理:机遇与挑战。
J Biomed Semantics. 2018 Mar 30;9(1):12. doi: 10.1186/s13326-018-0179-8.
9
Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress.临床数据的再利用或二次使用:现状与未来潜在进展
Yearb Med Inform. 2017 Aug;26(1):38-52. doi: 10.15265/IY-2017-007. Epub 2017 Sep 11.
10
Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.支持语义分析的临床自然语言处理的最新进展。
Yearb Med Inform. 2015 Aug 13;10(1):183-93. doi: 10.15265/IY-2015-009.

本文引用的文献

1
Testing tactics to localize de-identification.测试定位去识别化的策略。
Stud Health Technol Inform. 2009;150:735-9.
2
Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial.开发一种用于去除瑞典语电子病历中标识符的标准:手动和计算机化注释试验中的精度、召回率和 F 度量。
Int J Med Inform. 2009 Dec;78(12):e19-26. doi: 10.1016/j.ijmedinf.2009.04.005. Epub 2009 May 23.
3
Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.
4
A de-identifier for medical discharge summaries.一份用于出院小结的去标识信息。
Artif Intell Med. 2008 Jan;42(1):13-35. doi: 10.1016/j.artmed.2007.10.001. Epub 2007 Nov 28.
5
Evaluating the state-of-the-art in automatic de-identification.评估自动去识别技术的最新进展。
J Am Med Inform Assoc. 2007 Sep-Oct;14(5):550-63. doi: 10.1197/jamia.M2444. Epub 2007 Jun 28.
6
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.评估一种用于共享病理学报告和临床文档以进行研究的去识别化(De-Id)软件引擎。
Am J Clin Pathol. 2004 Feb;121(2):176-86. doi: 10.1309/E6K3-3GBP-E5C2-7FYU.
7
Replacing personally-identifying information in medical records, the Scrub system.Scrub系统:替换病历中的个人身份识别信息。
Proc AMIA Annu Fall Symp. 1996:333-7.