Suppr超能文献

用于去识别化的纵向临床记录标注:2014年i2b2/德克萨斯大学健康科学中心语料库

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.

作者信息

Stubbs Amber, Uzuner Özlem

机构信息

School of Library and Information Science, Simmons College, Boston, MA, USA.

Department of Information Studies, State University of New York at Albany, Albany, NY, USA.

出版信息

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.

Abstract

The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations.

摘要

2014年i2b2/德克萨斯大学健康科学中心自然语言处理共享任务中有一个专注于纵向医疗记录去识别化的赛道。针对这个赛道,我们对一组描述296名患者的1304份纵向医疗记录进行了去识别化处理。该语料库是根据对《健康保险流通与责任法案》(HIPAA)指南的宽泛解释进行去识别化的,采用了双重标注,随后进行仲裁、多轮合理性检查和校对。与金标准相比,注释者基于token的平均F1值为0.927。所得注释既用于对数据进行去识别化,也用于为2014年i2b2/德克萨斯大学健康科学中心共享任务的去识别化赛道设定金标准。所有带注释的私人健康信息都自动替换为逼真的替代物,然后进行人工审阅和修正。所得语料库是首个可用于去识别化研究的此类语料库。该语料库首次用于2014年i2b2/德克萨斯大学健康科学中心共享任务,在此期间,各系统使用基于实体的微观平均评估方法,平均F值达到0.872,最大F值达到0.964。

相似文献

1
Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.
2
Annotating risk factors for heart disease in clinical narratives for diabetic patients.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S78-S91. doi: 10.1016/j.jbi.2015.05.009. Epub 2015 May 21.
3
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.
4
CRFs based de-identification of medical records.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S39-S46. doi: 10.1016/j.jbi.2015.08.012. Epub 2015 Aug 24.
5
Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S67-S77. doi: 10.1016/j.jbi.2015.07.001. Epub 2015 Jul 22.
6
Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S47-S52. doi: 10.1016/j.jbi.2015.06.009. Epub 2015 Jun 26.
7
Combining glass box and black box evaluations in the identification of heart disease risk factors and their temporal relations from clinical records.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S133-S142. doi: 10.1016/j.jbi.2015.06.014. Epub 2015 Jul 2.
8
Combining knowledge- and data-driven methods for de-identification of clinical narratives.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S53-S59. doi: 10.1016/j.jbi.2015.06.029. Epub 2015 Jul 22.
9
The role of fine-grained annotations in supervised recognition of risk factors for heart disease from EHRs.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S111-S119. doi: 10.1016/j.jbi.2015.06.010. Epub 2015 Jun 26.
10
Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.
J Biomed Inform. 2015 Dec;58 Suppl(0):S120-S127. doi: 10.1016/j.jbi.2015.06.030. Epub 2015 Jul 22.

引用本文的文献

1
Machine learning in psychiatric health records: A gold standard approach to trauma annotation.
Transl Psychiatry. 2025 Aug 1;15(1):260. doi: 10.1038/s41398-025-03487-0.
2
Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing.
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:441-450. eCollection 2025.
3
Machine Learning in Psychiatric Health Records: A Gold Standard Approach to Trauma Annotation.
medRxiv. 2025 Mar 11:2025.03.09.25323272. doi: 10.1101/2025.03.09.25323272.
4
pyDeid: an improved, fast, flexible, and generalizable rule-based approach for deidentification of free-text medical records.
JAMIA Open. 2025 Jan 22;8(1):ooae152. doi: 10.1093/jamiaopen/ooae152. eCollection 2025 Feb.
5
Automated redaction of names in adverse event reports using transformer-based neural networks.
BMC Med Inform Decis Mak. 2024 Dec 23;24(1):401. doi: 10.1186/s12911-024-02785-9.
6
7
Question Answering for Electronic Health Records: Scoping Review of Datasets and Models.
J Med Internet Res. 2024 Oct 30;26:e53636. doi: 10.2196/53636.
8
On the development and validation of large language model-based classifiers for identifying social determinants of health.
Proc Natl Acad Sci U S A. 2024 Sep 24;121(39):e2320716121. doi: 10.1073/pnas.2320716121. Epub 2024 Sep 16.
9
Transformers and large language models in healthcare: A review.
Artif Intell Med. 2024 Aug;154:102900. doi: 10.1016/j.artmed.2024.102900. Epub 2024 Jun 5.
10
RT: a Retrieving and Chain-of-Thought framework for few-shot medical named entity recognition.
J Am Med Inform Assoc. 2024 Sep 1;31(9):1929-1938. doi: 10.1093/jamia/ocae095.

本文引用的文献

1
Creation of a new longitudinal corpus of clinical narratives.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S6-S10. doi: 10.1016/j.jbi.2015.09.018. Epub 2015 Oct 1.
2
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.
3
Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2.
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S67-S77. doi: 10.1016/j.jbi.2015.07.001. Epub 2015 Jul 22.
5
Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.
J Biomed Inform. 2014 Aug;50:173-183. doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.
6
Clinical decision support with automated text processing for cervical cancer screening.
J Am Med Inform Assoc. 2012 Sep-Oct;19(5):833-9. doi: 10.1136/amiajnl-2012-000820. Epub 2012 Apr 29.
7
Portability of an algorithm to identify rheumatoid arthritis in electronic health records.
J Am Med Inform Assoc. 2012 Jun;19(e1):e162-9. doi: 10.1136/amiajnl-2011-000583. Epub 2012 Feb 28.
8
EliXR: an approach to eligibility criteria extraction and representation.
J Am Med Inform Assoc. 2011 Dec;18 Suppl 1(Suppl 1):i116-24. doi: 10.1136/amiajnl-2011-000321. Epub 2011 Jul 31.
9
What can natural language processing do for clinical decision support?
J Biomed Inform. 2009 Oct;42(5):760-72. doi: 10.1016/j.jbi.2009.08.007. Epub 2009 Aug 13.
10
Recognizing obesity and comorbidities in sparse data.
J Am Med Inform Assoc. 2009 Jul-Aug;16(4):561-70. doi: 10.1197/jamia.M3115. Epub 2009 Apr 23.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验