用句法块和命名实体标注患者临床记录：哈维语料库。

Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.

作者信息

Savkov Aleksandar, Carroll John, Koeling Rob, Cassell Jackie

机构信息

Department of Informatics, University of Sussex, Brighton, BN1 9QJ UK.

Division of Primary Care and Public Health, Brighton and Sussex Medical School, Brighton, BN1 9PH UK.

出版信息

Lang Resour Eval. 2016;50:523-548. doi: 10.1007/s10579-015-9330-7. Epub 2016 Jan 11.

DOI:10.1007/s10579-015-9330-7

PMID:27570501

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4983282/

Abstract

The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning.

摘要

医生在患者会诊期间键入的自由文本注释包含用于疾病研究和治疗的宝贵信息。这些注释难以被现有的自然语言分析工具处理，因为它们高度简洁（省略了许多单词），并且包含许多拼写错误、标点不一致以及非标准词序。为了支持对此类文本的信息提取和分类任务，我们描述了一个自由文本注释的去识别语料库、一种针对此类文本的浅层句法和命名实体注释方案，以及一种培训没有语言背景的领域专家对文本进行注释的方法。最后，我们提出了一个针对此类临床文本的统计分块系统，该系统具有稳定的学习率和良好的准确性，表明人工注释是一致的，并且该注释方案对于机器学习是易于处理的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63a3/4983282/72c90a36403a/10579_2015_9330_Fig1_HTML.jpg

相似文献

Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.

Lang Resour Eval. 2016;50:523-548. doi: 10.1007/s10579-015-9330-7. Epub 2016 Jan 11.

Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study.

Interact J Med Res. 2023 Aug 25;12:e46322. doi: 10.2196/46322.

Extending TextAE for annotation of non-contiguous entities.

Genomics Inform. 2020 Jun;18(2):e15. doi: 10.5808/GI.2020.18.2.e15. Epub 2020 Jun 15.

Clinical Text Data in Machine Learning: Systematic Review.

JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.

J Am Med Inform Assoc. 2013 Nov-Dec;20(6):1168-77. doi: 10.1136/amiajnl-2013-001810. Epub 2013 Aug 1.

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.

J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.

Active learning for ontological event extraction incorporating named entity recognition and unknown word handling.

J Biomed Semantics. 2016 Apr 27;7:22. doi: 10.1186/s13326-016-0059-z. eCollection 2016.

Synthetic data for annotation and extraction of family history information from clinical text.

J Biomed Semantics. 2021 Jul 14;12(1):11. doi: 10.1186/s13326-021-00244-2.

Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: an annotation and machine learning study.

J Biomed Inform. 2014 Jun;49:148-58. doi: 10.1016/j.jbi.2014.01.012. Epub 2014 Feb 4.

Constructing a Chinese electronic medical record corpus for named entity recognition on resident admit notes.

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):56. doi: 10.1186/s12911-019-0759-2.

引用本文的文献

Hierarchical embedding attention for overall survival prediction in lung cancer from unstructured EHRs.

BMC Med Inform Decis Mak. 2025 Apr 18;25(1):169. doi: 10.1186/s12911-025-02998-6.

Google trend analysis of the Indian population reveals a panel of seasonally sensitive comorbid symptoms with implications for monitoring the seasonally sensitive human population.

Popul Health Metr. 2024 Dec 30;22(1):40. doi: 10.1186/s12963-024-00349-7.

Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method.

Int J Environ Res Public Health. 2023 Feb 28;20(5):4340. doi: 10.3390/ijerph20054340.

A scoping review of publicly available language tasks in clinical natural language processing.

J Am Med Inform Assoc. 2022 Sep 12;29(10):1797-1806. doi: 10.1093/jamia/ocac127.

Health Professionals' Perception about Big Data Technology in Greece.

Acta Inform Med. 2020 Mar;28(1):48-51. doi: 10.5455/aim.2020.28.48-51.

Natural language processing for disease phenotyping in UK primary care records for research: a pilot study in myocardial infarction and death.

J Biomed Semantics. 2019 Nov 12;10(Suppl 1):20. doi: 10.1186/s13326-019-0214-4.

Design of an extensive information representation scheme for clinical narratives.

J Biomed Semantics. 2017 Sep 11;8(1):37. doi: 10.1186/s13326-017-0135-z.

本文引用的文献

Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences.

J Am Med Inform Assoc. 2013 Nov-Dec;20(6):1168-77. doi: 10.1136/amiajnl-2013-001810. Epub 2013 Aug 1.

Annotating temporal information in clinical narratives.

J Biomed Inform. 2013 Dec;46 Suppl(0):S5-S12. doi: 10.1016/j.jbi.2013.07.004. Epub 2013 Jul 19.

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.

BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.

The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records.

BMC Med Inform Decis Mak. 2012 Aug 7;12:88. doi: 10.1186/1472-6947-12-88.

Concept annotation in the CRAFT corpus.

BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.

Part-of-speech tagging for clinical text: wall or bridge between institutions?

AMIA Annu Symp Proc. 2011;2011:382-91. Epub 2011 Oct 22.

2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.

Community annotation experiment for ground truth generation for the i2b2 medication challenge.

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):519-23. doi: 10.1136/jamia.2010.004200.

Extracting medication information from clinical text.

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):514-8. doi: 10.1136/jamia.2010.003947.

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507-13. doi: 10.1136/jamia.2009.001560.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用句法块和命名实体标注患者临床记录：哈维语料库。

Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献