利用机器学习从多个数据源的临床文档中提取概念。

Using machine learning for concept extraction on clinical documents from multiple data sources.

机构信息

Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, Washington, DC 20007, USA.

出版信息

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):580-7. doi: 10.1136/amiajnl-2011-000155. Epub 2011 Jun 27.

DOI:10.1136/amiajnl-2011-000155

PMID:21709161

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3168314/

Abstract

OBJECTIVE

Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources.

METHODS

We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources.

RESULTS

As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training.

CONCLUSION

Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.

摘要

目的

概念提取是一种从非结构化文本中识别与感兴趣概念相关的短语的过程。它是自动化文本处理的关键组成部分。我们研究了机器学习标记器在临床概念提取方面的性能，特别是标记器在来自多个数据源的多个文档之间的可移植性。

方法

我们使用 BioTagger-GM 来训练机器学习标记器，该标记器最初是为生物学领域的基因/蛋白质名称检测而开发的。使用在 2010 年 i2b2/VA 挑战赛研讨会上提供的已注释临床文档对经过训练的标记器进行评估，这些文档来自四个数据源。

结果

正如预期的那样，在另一个源上评估时，在一个源上训练的标记器的性能会下降，但性能的下降因数据源而异。在多个数据源上训练的标记器具有很强的鲁棒性，在一个数据源上的 F 分数高达 0.890。结果还表明，如果有更多的注释文档可用于训练，那么机器学习标记器的性能可能会提高。

结论

我们的研究表明，当机器学习标记器在来自不同来源的临床文档之间移植时，其性能会下降。通过在多个来源的数据集上进行训练，可以增强标记器的可移植性。该研究还表明，BioTagger-GM 可以轻松扩展以检测具有良好性能的临床概念提及。

相似文献

Using machine learning for concept extraction on clinical documents from multiple data sources.利用机器学习从多个数据源的临床文档中提取概念。

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):580-7. doi: 10.1136/amiajnl-2011-000155. Epub 2011 Jun 27.

Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.开发和评估 RapTAT：一种用于从医学叙述中映射短语概念的机器学习系统。

J Biomed Inform. 2014 Apr;48:54-65. doi: 10.1016/j.jbi.2013.11.008. Epub 2013 Dec 4.

Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification.混合方法提高临床文档信息获取：概念、断言和关系识别。

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):588-93. doi: 10.1136/amiajnl-2011-000154. Epub 2011 May 19.

2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.2010 i2b2/VA 挑战赛：临床文本中的概念、断言和关系

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.

Enhancing clinical concept extraction with distributional semantics.利用分布语义增强临床概念提取。

J Biomed Inform. 2012 Feb;45(1):129-40. doi: 10.1016/j.jbi.2011.10.007. Epub 2011 Nov 7.

A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries.基于机器学习的方法从出院小结中提取临床实体及其断言的研究。

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):601-6. doi: 10.1136/amiajnl-2011-000163. Epub 2011 Apr 20.

Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies.开发和应用一种高通量自然语言处理架构，将临床数据仓库中的所有临床文档转换为标准化的医学词汇。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1364-1369. doi: 10.1093/jamia/ocz068.

Feasibility of pooling annotated corpora for clinical concept extraction.用于临床概念提取的标注语料库合并的可行性。

AMIA Jt Summits Transl Sci Proc. 2012;2012:38. Epub 2012 Mar 19.

Machine learning-based coreference resolution of concepts in clinical documents.基于机器学习的临床文档中概念的共指消解。

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):883-7. doi: 10.1136/amiajnl-2011-000774. Epub 2012 May 12.

A rule based solution to co-reference resolution in clinical text.基于规则的临床文本共指消解解决方案。

J Am Med Inform Assoc. 2013 Sep-Oct;20(5):891-7. doi: 10.1136/amiajnl-2011-000770. Epub 2012 Oct 11.

引用本文的文献

Deep learning classification of pediatric spinal radiographs for use in large scale imaging registries.深度学习分类儿科脊柱 X 光片，用于大规模成像注册。

Spine Deform. 2024 Nov;12(6):1607-1614. doi: 10.1007/s43390-024-00933-9. Epub 2024 Jul 22.

Probing Patient Messages Enhanced by Natural Language Processing: A Top-Down Message Corpus Analysis.探索通过自然语言处理增强的患者信息：自上而下的信息语料库分析。

Health Data Sci. 2021 May 18;2021:1504854. doi: 10.34133/2021/1504854. eCollection 2021.

Combining unsupervised, supervised and rule-based learning: the case of detecting patient allergies in electronic health records.结合无监督、监督和基于规则的学习：以电子健康记录中检测患者过敏为例。

BMC Med Inform Decis Mak. 2023 Sep 18;23(1):188. doi: 10.1186/s12911-023-02271-8.

Quality assessment of functional status documentation in EHRs across different healthcare institutions.不同医疗机构电子健康记录中功能状态文档的质量评估

Front Digit Health. 2022 Sep 27;4:958539. doi: 10.3389/fdgth.2022.958539. eCollection 2022.

PO2RDF: representation of real-world data for precision oncology using resource description framework.PO2RDF：使用资源描述框架（Resource Description Framework，RDF）为精准肿瘤学表示真实世界的数据。

BMC Med Genomics. 2022 Jul 30;15(1):167. doi: 10.1186/s12920-022-01314-9.

A Hybrid Protocol for Identifying Comorbidity-Based Potential Drugs for COVID-19 Using Biomedical Literature Mining, Network Analysis, and Deep Learning.一种使用生物医学文献挖掘、网络分析和深度学习来识别基于合并症的 COVID-19 潜在药物的混合协议。

Methods Mol Biol. 2022;2496:203-219. doi: 10.1007/978-1-0716-2305-3_11.

Salience of Medical Concepts of Inside Clinical Texts and Outside Medical Records for Referred Cardiovascular Patients.临床文本中及转诊心血管患者病历之外的医学概念对患者的显著程度

J Healthc Inform Res. 2019 Jan 28;3(2):200-219. doi: 10.1007/s41666-019-00044-5. eCollection 2019 Jun.

Named Entity Recognition of Medical Text Based on the Deep Neural Network.基于深度神经网络的医学文本命名实体识别

J Healthc Eng. 2022 Mar 7;2022:3990563. doi: 10.1155/2022/3990563. eCollection 2022.

Composition-driven symptom phrase recognition for Chinese medical consultation corpora.基于构成的症状短语识别在中医问诊语料库中的应用。

BMC Med Inform Decis Mak. 2021 Dec 27;21(1):363. doi: 10.1186/s12911-021-01716-2.

Integration of NLP2FHIR Representation with Deep Learning Models for EHR Phenotyping: A Pilot Study on Obesity Datasets.自然语言处理 (NLP) 到 FHIR 表示的集成与深度学习模型在电子健康记录表型中的应用：肥胖数据集的初步研究。

AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:410-419. eCollection 2021.

本文引用的文献

Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine.使用支持向量机识别医院出院小结中的药物相关实体。

Proc Int Conf Comput Ling. 2010 Aug;2010:259-266.

Lancet: a high precision medication event extraction system for clinical text.柳叶刀：一个用于临床文本的高精度药物事件抽取系统。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):563-7. doi: 10.1136/jamia.2010.004077.

High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge.从临床记录中提取药物信息的高精度信息提取：2009 i2b2 药物提取挑战赛。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):524-7. doi: 10.1136/jamia.2010.003939.

Extracting medication information from clinical text.从临床文本中提取药物信息。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):514-8. doi: 10.1136/jamia.2010.003947.

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.梅奥临床文本分析和知识提取系统（cTAKES）：架构、组件评估和应用。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507-13. doi: 10.1136/jamia.2009.001560.

MedEx: a medication information extraction system for clinical narratives.MedEx：一个用于临床叙述的药物信息提取系统。

J Am Med Inform Assoc. 2010 Jan-Feb;17(1):19-24. doi: 10.1197/jamia.M3378.

What can natural language processing do for clinical decision support?自然语言处理能为临床决策支持做些什么？

J Biomed Inform. 2009 Oct;42(5):760-72. doi: 10.1016/j.jbi.2009.08.007. Epub 2009 Aug 13.

BioTagger-GM: a gene/protein name recognition system.生物标记器-GM：一种基因/蛋白质名称识别系统。

J Am Med Inform Assoc. 2009 Mar-Apr;16(2):247-55. doi: 10.1197/jamia.M2844. Epub 2008 Dec 11.

Overview of BioCreative II gene mention recognition.生物创意II基因提及识别概述。

Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.

Mapping terms to UMLS concepts of the same semantic type.将术语映射到相同语义类型的统一医学语言系统（UMLS）概念。

AMIA Annu Symp Proc. 2007 Oct 11:1136.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。