Department of Radiation Oncology, Peter MacCallum Cancer Centre, Melbourne, Victoria, Australia.
J Am Med Inform Assoc. 2014 Jan-Feb;21(1):27-30. doi: 10.1136/amiajnl-2013-002090. Epub 2013 Aug 6.
This study aimed to reduce reliance on large training datasets in support vector machine (SVM)-based clinical text analysis by categorizing keyword features. An enhanced Mayo smoking status detection pipeline was deployed. We used a corpus of 709 annotated patient narratives. The pipeline was optimized for local data entry practice and lexicon. SVM classifier retraining used a grouped keyword approach for better efficiency. Accuracy, precision, and F-measure of the unaltered and optimized pipelines were evaluated using k-fold cross-validation. Initial accuracy of the clinical Text Analysis and Knowledge Extraction System (cTAKES) package was 0.69. Localization and keyword grouping improved system accuracy to 0.9 and 0.92, respectively. F-measures for current and past smoker classes improved from 0.43 to 0.81 and 0.71 to 0.91, respectively. Non-smoker and unknown-class F-measures were 0.96 and 0.98, respectively. Keyword grouping had no negative effect on performance, and decreased training time. Grouping keywords is a practical method to reduce training corpus size.
本研究旨在通过对关键词特征进行分类,减少基于支持向量机 (SVM) 的临床文本分析对大型训练数据集的依赖。部署了一个增强的 Mayo 吸烟状况检测管道。我们使用了一个包含 709 个注释患者叙述的语料库。该管道针对本地数据输入实践和词汇进行了优化。SVM 分类器重新训练使用分组关键词方法以提高效率。使用 k 折交叉验证评估了未修改和优化管道的准确性、精度和 F 度量。临床文本分析和知识提取系统 (cTAKES) 包的初始准确性为 0.69。本地化和关键词分组将系统准确性分别提高到 0.9 和 0.92。当前和过去吸烟者类别的 F 度量分别从 0.43 提高到 0.81 和从 0.71 提高到 0.91。非吸烟者和未知类别 F 度量分别为 0.96 和 0.98。关键词分组对性能没有负面影响,并且减少了训练时间。关键词分组是减少训练语料库大小的实用方法。