Suppr超能文献

利用现有现成方法实现更好的公共卫生报告:医学词典在使用纯文本医学数据进行自动癌症检测中的价值。

Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data.

作者信息

Kasthurirathne Suranga N, Dixon Brian E, Gichoya Judy, Xu Huiping, Xia Yuni, Mamlin Burke, Grannis Shaun J

机构信息

Indiana University School of Informatics and Computing, Indianapolis, IN, USA.

Regenstrief Institute, Indianapolis, IN, USA; Indiana University Richard M. Fairbanks School of Public Health, Indianapolis, IN, USA.

出版信息

J Biomed Inform. 2017 May;69:160-176. doi: 10.1016/j.jbi.2017.04.008. Epub 2017 Apr 12.

Abstract

OBJECTIVES

Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and "off the shelf" tools could predict cancer with performance metrics between 80% and 90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries.

MATERIALS AND METHODS

We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve.

RESULTS

Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90%. The source of features and feature subset size had no impact on the performance of a decision model.

CONCLUSION

Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing "off the shelf" approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.

摘要

目标

从纯文本临床数据中推导决策模型的现有方法通常依赖医学词典作为潜在特征的来源。先前的研究表明,使用基于非词典的特征获取方法和“现成”工具开发的决策模型在预测癌症时的性能指标在80%至90%之间。我们试图将基于非词典的模型与使用从医学词典中派生的特征构建的模型进行比较。

材料与方法

我们使用由基于词典或非词典的特征获取方法组合、4种特征子集大小和5种分类算法构建的决策模型,评估从自由文本病理报告中检测癌症病例的情况。每个决策模型使用以下性能指标进行评估:敏感性、特异性、准确性、阳性预测值和受试者操作特征(ROC)曲线下面积。

结果

使用词典和非词典特征获取方法参数化的决策模型产生的性能指标在70%至90%之间。特征来源和特征子集大小对决策模型的性能没有影响。

结论

我们的研究表明,利用医学词典为决策模型构建提取特征几乎没有价值。使用从纯文本报告本身提取的特征构建的决策模型与使用医学词典构建的模型取得了可比的结果。总体而言,这表明可以利用现有的“现成”方法,通过不太复杂的基于命名实体识别(NER)的特征提取、自动特征选择和建模方法来进行准确的癌症检测。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验