Suppr超能文献

利用机器学习从多个数据源的临床文档中提取概念。

Using machine learning for concept extraction on clinical documents from multiple data sources.

机构信息

Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, Washington, DC 20007, USA.

出版信息

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):580-7. doi: 10.1136/amiajnl-2011-000155. Epub 2011 Jun 27.

Abstract

OBJECTIVE

Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources.

METHODS

We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources.

RESULTS

As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training.

CONCLUSION

Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.

摘要

目的

概念提取是一种从非结构化文本中识别与感兴趣概念相关的短语的过程。它是自动化文本处理的关键组成部分。我们研究了机器学习标记器在临床概念提取方面的性能,特别是标记器在来自多个数据源的多个文档之间的可移植性。

方法

我们使用 BioTagger-GM 来训练机器学习标记器,该标记器最初是为生物学领域的基因/蛋白质名称检测而开发的。使用在 2010 年 i2b2/VA 挑战赛研讨会上提供的已注释临床文档对经过训练的标记器进行评估,这些文档来自四个数据源。

结果

正如预期的那样,在另一个源上评估时,在一个源上训练的标记器的性能会下降,但性能的下降因数据源而异。在多个数据源上训练的标记器具有很强的鲁棒性,在一个数据源上的 F 分数高达 0.890。结果还表明,如果有更多的注释文档可用于训练,那么机器学习标记器的性能可能会提高。

结论

我们的研究表明,当机器学习标记器在来自不同来源的临床文档之间移植时,其性能会下降。通过在多个来源的数据集上进行训练,可以增强标记器的可移植性。该研究还表明,BioTagger-GM 可以轻松扩展以检测具有良好性能的临床概念提及。

相似文献

4
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.2010 i2b2/VA 挑战赛:临床文本中的概念、断言和关系
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.
5
Enhancing clinical concept extraction with distributional semantics.利用分布语义增强临床概念提取。
J Biomed Inform. 2012 Feb;45(1):129-40. doi: 10.1016/j.jbi.2011.10.007. Epub 2011 Nov 7.
9
Machine learning-based coreference resolution of concepts in clinical documents.基于机器学习的临床文档中概念的共指消解。
J Am Med Inform Assoc. 2012 Sep-Oct;19(5):883-7. doi: 10.1136/amiajnl-2011-000774. Epub 2012 May 12.
10
A rule based solution to co-reference resolution in clinical text.基于规则的临床文本共指消解解决方案。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):891-7. doi: 10.1136/amiajnl-2011-000770. Epub 2012 Oct 11.

引用本文的文献

本文引用的文献

4
Extracting medication information from clinical text.从临床文本中提取药物信息。
J Am Med Inform Assoc. 2010 Sep-Oct;17(5):514-8. doi: 10.1136/jamia.2010.003947.
7
What can natural language processing do for clinical decision support?自然语言处理能为临床决策支持做些什么?
J Biomed Inform. 2009 Oct;42(5):760-72. doi: 10.1016/j.jbi.2009.08.007. Epub 2009 Aug 13.
8
BioTagger-GM: a gene/protein name recognition system.生物标记器-GM:一种基因/蛋白质名称识别系统。
J Am Med Inform Assoc. 2009 Mar-Apr;16(2):247-55. doi: 10.1197/jamia.M2844. Epub 2008 Dec 11.
9
Overview of BioCreative II gene mention recognition.生物创意II基因提及识别概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验