利用现有现成方法实现更好的公共卫生报告：医学词典在使用纯文本医学数据进行自动癌症检测中的价值。

Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data.

作者信息

Kasthurirathne Suranga N, Dixon Brian E, Gichoya Judy, Xu Huiping, Xia Yuni, Mamlin Burke, Grannis Shaun J

机构信息

Indiana University School of Informatics and Computing, Indianapolis, IN, USA.

Regenstrief Institute, Indianapolis, IN, USA; Indiana University Richard M. Fairbanks School of Public Health, Indianapolis, IN, USA.

出版信息

J Biomed Inform. 2017 May;69:160-176. doi: 10.1016/j.jbi.2017.04.008. Epub 2017 Apr 12.

DOI:10.1016/j.jbi.2017.04.008

PMID:28410983

Abstract

OBJECTIVES

Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and "off the shelf" tools could predict cancer with performance metrics between 80% and 90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries.

MATERIALS AND METHODS

We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve.

RESULTS

Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90%. The source of features and feature subset size had no impact on the performance of a decision model.

CONCLUSION

Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing "off the shelf" approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.

摘要

目标

从纯文本临床数据中推导决策模型的现有方法通常依赖医学词典作为潜在特征的来源。先前的研究表明，使用基于非词典的特征获取方法和“现成”工具开发的决策模型在预测癌症时的性能指标在80%至90%之间。我们试图将基于非词典的模型与使用从医学词典中派生的特征构建的模型进行比较。

材料与方法

我们使用由基于词典或非词典的特征获取方法组合、4种特征子集大小和5种分类算法构建的决策模型，评估从自由文本病理报告中检测癌症病例的情况。每个决策模型使用以下性能指标进行评估：敏感性、特异性、准确性、阳性预测值和受试者操作特征（ROC）曲线下面积。

结果

使用词典和非词典特征获取方法参数化的决策模型产生的性能指标在70%至90%之间。特征来源和特征子集大小对决策模型的性能没有影响。

结论

我们的研究表明，利用医学词典为决策模型构建提取特征几乎没有价值。使用从纯文本报告本身提取的特征构建的决策模型与使用医学词典构建的模型取得了可比的结果。总体而言，这表明可以利用现有的“现成”方法，通过不太复杂的基于命名实体识别（NER）的特征提取、自动特征选择和建模方法来进行准确的癌症检测。

相似文献

Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data.

J Biomed Inform. 2017 May;69:160-176. doi: 10.1016/j.jbi.2017.04.008. Epub 2017 Apr 12.

Toward better public health reporting using existing off the shelf approaches: A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection.

J Biomed Inform. 2016 Apr;60:145-52. doi: 10.1016/j.jbi.2016.01.008. Epub 2016 Jan 28.

Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection.

PLoS One. 2017 Feb 6;12(2):e0170242. doi: 10.1371/journal.pone.0170242. eCollection 2017.

Evaluating Methods for Identifying Cancer in Free-Text Pathology Reports Using Various Machine Learning and Data Preprocessing Approaches.

Stud Health Technol Inform. 2015;216:1070.

Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.

J Am Med Inform Assoc. 2015 Sep;22(5):993-1000. doi: 10.1093/jamia/ocv034. Epub 2015 Apr 29.

Alzheimer's disease detection via automatic 3D caudate nucleus segmentation using coupled dictionary learning with level set formulation.

Comput Methods Programs Biomed. 2016 Dec;137:329-339. doi: 10.1016/j.cmpb.2016.09.007. Epub 2016 Sep 28.

Prediction of cause of death from forensic autopsy reports using text classification techniques: A comparative study.

J Forensic Leg Med. 2018 Jul;57:41-50. doi: 10.1016/j.jflm.2017.07.001. Epub 2017 Jul 4.

How to make the most of NE dictionaries in statistical NER.

BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S5. doi: 10.1186/1471-2105-9-S11-S5.

Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records.

JMIR Med Inform. 2017 Jun 29;5(2):e17. doi: 10.2196/medinform.7123.

Building a controlled health vocabulary in Japanese.

Methods Inf Med. 2001;40(4):307-14.

引用本文的文献

Secondary use of health records for prediction, detection, and treatment planning in the clinical decision support system: a systematic review.

BMC Med Inform Decis Mak. 2025 May 16;25(1):190. doi: 10.1186/s12911-025-03021-8.

Generative Adversarial Networks for Creating Synthetic Free-Text Medical Data: A Proposal for Collaborative Research and Re-use of Machine Learning Models.

AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:335-344. eCollection 2021.

Different approaches to improve cohort identification using electronic health records: X-linked hypophosphatemia as an example.

Intractable Rare Dis Res. 2021 Feb;10(1):17-22. doi: 10.5582/irdr.2020.03123.

Clinical concept extraction: A methodology review.

J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.

Generalization of Machine Learning Approaches to Identify Notifiable Conditions from a Statewide Health Information Exchange.

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:152-161. eCollection 2020.

Machine Learning Approaches to Identify Nicknames from A Statewide Health Information Exchange.

AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:639-647. eCollection 2019.

Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review.

JMIR Med Inform. 2019 Apr 27;7(2):e12239. doi: 10.2196/12239.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用现有现成方法实现更好的公共卫生报告：医学词典在使用纯文本医学数据进行自动癌症检测中的价值。

Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

目标

材料与方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献