Human Language Technology Research Institute, University of Texas at Dallas, Richardson, Texas 75080-0688, USA.
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):568-73. doi: 10.1136/amiajnl-2011-000152. Epub 2011 Jul 1.
This paper describes natural-language-processing techniques for two tasks: identification of medical concepts in clinical text, and classification of assertions, which indicate the existence, absence, or uncertainty of a medical problem. Because so many resources are available for processing clinical texts, there is interest in developing a framework in which features derived from these resources can be optimally selected for the two tasks of interest.
The authors used two machine-learning (ML) classifiers: support vector machines (SVMs) and conditional random fields (CRFs). Because SVMs and CRFs can operate on a large set of features extracted from both clinical texts and external resources, the authors address the following research question: Which features need to be selected for obtaining optimal results? To this end, the authors devise feature-selection techniques which greatly reduce the amount of manual experimentation and improve performance.
The authors evaluated their approaches on the 2010 i2b2/VA challenge data. Concept extraction achieves 79.59 micro F-measure. Assertion classification achieves 93.94 micro F-measure.
Approaching medical concept extraction and assertion classification through ML-based techniques has the advantage of easily adapting to new data sets and new medical informatics tasks. However, ML-based techniques perform best when optimal features are selected. By devising promising feature-selection techniques, the authors obtain results that outperform the current state of the art.
This paper presents two ML-based approaches for processing language in the clinical texts evaluated in the 2010 i2b2/VA challenge. By using novel feature-selection methods, the techniques presented in this paper are unique among the i2b2 participants.
本文描述了两种自然语言处理技术任务:在临床文本中识别医学概念,以及对断言进行分类,这些断言表明存在、不存在或不确定医学问题。由于有如此多的资源可用于处理临床文本,因此人们有兴趣开发一种框架,在该框架中,可以针对两个感兴趣的任务最优地选择从这些资源中得出的特征。
作者使用了两种机器学习(ML)分类器:支持向量机(SVM)和条件随机场(CRF)。由于 SVM 和 CRF 可以对从临床文本和外部资源中提取的大量特征进行操作,因此作者提出了以下研究问题:需要选择哪些特征才能获得最佳结果?为此,作者设计了特征选择技术,这些技术大大减少了手动实验的次数并提高了性能。
作者在 2010 年 i2b2/VA 挑战赛的数据上评估了他们的方法。概念提取的微 F1 值达到 79.59。断言分类的微 F1 值达到 93.94。
通过基于 ML 的技术来处理医学概念提取和断言分类具有易于适应新数据集和新医学信息学任务的优势。但是,只有在选择最佳特征时,基于 ML 的技术才能发挥最佳性能。通过设计有前途的特征选择技术,作者获得了优于当前最先进水平的结果。
本文提出了两种基于 ML 的方法来处理 2010 年 i2b2/VA 挑战赛中评估的临床文本中的语言。通过使用新颖的特征选择方法,本文提出的技术在 i2b2 参与者中是独一无二的。