Real-time Outbreak and Disease Surveillance Laboratory (RODS), Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA Intelligent Systems Program, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
Real-time Outbreak and Disease Surveillance Laboratory (RODS), Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):815-23. doi: 10.1136/amiajnl-2013-001934. Epub 2014 Jan 9.
To evaluate factors affecting performance of influenza detection, including accuracy of natural language processing (NLP), discriminative ability of Bayesian network (BN) classifiers, and feature selection.
We derived a testing dataset of 124 influenza patients and 87 non-influenza (shigellosis) patients. To assess NLP finding-extraction performance, we measured the overall accuracy, recall, and precision of Topaz and MedLEE parsers for 31 influenza-related findings against a reference standard established by three physician reviewers. To elucidate the relative contribution of NLP and BN classifier to classification performance, we compared the discriminative ability of nine combinations of finding-extraction methods (expert, Topaz, and MedLEE) and classifiers (one human-parameterized BN and two machine-parameterized BNs). To assess the effects of feature selection, we conducted secondary analyses of discriminative ability using the most influential findings defined by their likelihood ratios.
The overall accuracy of Topaz was significantly better than MedLEE (with post-processing) (0.78 vs 0.71, p<0.0001). Classifiers using human-annotated findings were superior to classifiers using Topaz/MedLEE-extracted findings (average area under the receiver operating characteristic (AUROC): 0.75 vs 0.68, p=0.0113), and machine-parameterized classifiers were superior to the human-parameterized classifier (average AUROC: 0.73 vs 0.66, p=0.0059). The classifiers using the 17 'most influential' findings were more accurate than classifiers using all 31 subject-matter expert-identified findings (average AUROC: 0.76>0.70, p<0.05).
Using a three-component evaluation method we demonstrated how one could elucidate the relative contributions of components under an integrated framework. To improve classification performance, this study encourages researchers to improve NLP accuracy, use a machine-parameterized classifier, and apply feature selection methods.
评估影响流感检测性能的因素,包括自然语言处理(NLP)的准确性、贝叶斯网络(BN)分类器的判别能力和特征选择。
我们从 124 例流感患者和 87 例非流感(志贺菌病)患者中提取了一个测试数据集。为了评估 NLP 发现提取性能,我们针对由三位医师审阅者建立的参考标准,测量了 Topaz 和 MedLEE 解析器对 31 个流感相关发现的整体准确性、召回率和精确率。为了阐明 NLP 和 BN 分类器对分类性能的相对贡献,我们比较了九种发现提取方法(专家、Topaz 和 MedLEE)和分类器(一种人工参数化 BN 和两种机器参数化 BN)的判别能力。为了评估特征选择的效果,我们使用似然比定义的最有影响力的发现进行了二次判别能力分析。
Topaz 的总体准确性明显优于 MedLEE(后处理)(0.78 比 0.71,p<0.0001)。使用人工注释发现的分类器优于使用 Topaz/MedLEE 提取发现的分类器(平均接收者操作特征曲线下面积(AUROC):0.75 比 0.68,p=0.0113),机器参数化分类器优于人工参数化分类器(平均 AUROC:0.73 比 0.66,p=0.0059)。使用 17 个“最有影响力”发现的分类器比使用所有 31 个主题专家识别发现的分类器更准确(平均 AUROC:0.76>0.70,p<0.05)。
使用三组件评估方法,我们展示了如何在集成框架下阐明组件的相对贡献。为了提高分类性能,本研究鼓励研究人员提高 NLP 准确性、使用机器参数化分类器和应用特征选择方法。