• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用可解释 AI 方法识别患者的吸烟状况:丹麦电子健康记录案例研究。

Identification of patients' smoking status using an explainable AI approach: a Danish electronic health records case study.

机构信息

SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, 5230, Denmark.

Department of Oncology, Lillebaelt Hospital, University Hospital of Southern Denmark, Vejle, 7100, Denmark.

出版信息

BMC Med Res Methodol. 2024 May 17;24(1):114. doi: 10.1186/s12874-024-02231-4.

DOI:10.1186/s12874-024-02231-4
PMID:38760718
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11100078/
Abstract

BACKGROUND

Smoking is a critical risk factor responsible for over eight million annual deaths worldwide. It is essential to obtain information on smoking habits to advance research and implement preventive measures such as screening of high-risk individuals. In most countries, including Denmark, smoking habits are not systematically recorded and at best documented within unstructured free-text segments of electronic health records (EHRs). This would require researchers and clinicians to manually navigate through extensive amounts of unstructured data, which is one of the main reasons that smoking habits are rarely integrated into larger studies. Our aim is to develop machine learning models to classify patients' smoking status from their EHRs.

METHODS

This study proposes an efficient natural language processing (NLP) pipeline capable of classifying patients' smoking status and providing explanations for the decisions. The proposed NLP pipeline comprises four distinct components, which are; (1) considering preprocessing techniques to address abbreviations, punctuation, and other textual irregularities, (2) four cutting-edge feature extraction techniques, i.e. Embedding, BERT, Word2Vec, and Count Vectorizer, employed to extract the optimal features, (3) utilization of a Stacking-based Ensemble (SE) model and a Convolutional Long Short-Term Memory Neural Network (CNN-LSTM) for the identification of smoking status, and (4) application of a local interpretable model-agnostic explanation to explain the decisions rendered by the detection models. The EHRs of 23,132 patients with suspected lung cancer were collected from the Region of Southern Denmark during the period 1/1/2009-31/12/2018. A medical professional annotated the data into 'Smoker' and 'Non-Smoker' with further classifications as 'Active-Smoker', 'Former-Smoker', and 'Never-Smoker'. Subsequently, the annotated dataset was used for the development of binary and multiclass classification models. An extensive comparison was conducted of the detection performance across various model architectures.

RESULTS

The results of experimental validation confirm the consistency among the models. However, for binary classification, BERT method with CNN-LSTM architecture outperformed other models by achieving precision, recall, and F1-scores between 97% and 99% for both Never-Smokers and Active-Smokers. In multiclass classification, the Embedding technique with CNN-LSTM architecture yielded the most favorable results in class-specific evaluations, with equal performance measures of 97% for Never-Smoker and measures in the range of 86 to 89% for Active-Smoker and 91-92% for Never-Smoker.

CONCLUSION

Our proposed NLP pipeline achieved a high level of classification performance. In addition, we presented the explanation of the decision made by the best performing detection model. Future work will expand the model's capabilities to analyze longer notes and a broader range of categories to maximize its utility in further research and screening applications.

摘要

背景

吸烟是一个关键的风险因素,导致全球每年有超过 800 万人死亡。了解吸烟习惯对于推进研究和实施预防措施(如对高危人群进行筛查)至关重要。在大多数国家,包括丹麦,吸烟习惯并没有系统地记录,最多也只是在电子健康记录(EHR)的非结构化自由文本段中记录。这将要求研究人员和临床医生手动浏览大量的非结构化数据,这也是吸烟习惯很少被纳入更大规模研究的主要原因之一。我们的目标是开发机器学习模型,以便从 EHR 中对患者的吸烟状况进行分类。

方法

本研究提出了一种高效的自然语言处理(NLP)管道,能够对患者的吸烟状况进行分类,并对决策提供解释。所提出的 NLP 管道由四个不同的组件组成,包括:(1)考虑预处理技术,以解决缩写、标点符号和其他文本不规则性问题;(2)四种先进的特征提取技术,即嵌入、BERT、Word2Vec 和计数向量器,用于提取最佳特征;(3)使用基于堆叠的集成(SE)模型和卷积长短期记忆神经网络(CNN-LSTM)来识别吸烟状况;(4)应用局部可解释的无模型解释方法来解释检测模型做出的决策。该研究从 2009 年 1 月 1 日至 2018 年 12 月 31 日期间从丹麦南部地区收集了 23132 名疑似肺癌患者的 EHR。一名医疗专业人员将数据标注为“吸烟者”和“非吸烟者”,并进一步细分为“活跃吸烟者”、“前吸烟者”和“从不吸烟者”。随后,对标注数据集进行了二元和多类分类模型的开发。对各种模型架构的检测性能进行了广泛比较。

结果

实验验证的结果证实了模型之间的一致性。然而,对于二元分类,BERT 方法与 CNN-LSTM 架构的表现优于其他模型,Never-Smoker 和 Active-Smoker 的准确率、召回率和 F1 得分均在 97%至 99%之间。在多类分类中,嵌入技术与 CNN-LSTM 架构在类别特定评估中产生了最有利的结果,Never-Smoker 的性能指标相同,均为 97%,而 Active-Smoker 的指标范围为 86%至 89%,Never-Smoker 的指标为 91%至 92%。

结论

我们提出的 NLP 管道实现了高水平的分类性能。此外,我们还展示了最佳检测模型决策的解释。未来的工作将扩展模型的功能,以分析更长的笔记和更广泛的类别,从而最大限度地提高其在进一步研究和筛选应用中的效用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/225e251ee2a3/12874_2024_2231_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/1d8ea7ba7652/12874_2024_2231_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/dded8b5cbe7a/12874_2024_2231_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/a6139227676f/12874_2024_2231_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/d4470d950bc6/12874_2024_2231_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/48f4e0814015/12874_2024_2231_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/8733a6e64653/12874_2024_2231_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/225e251ee2a3/12874_2024_2231_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/1d8ea7ba7652/12874_2024_2231_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/dded8b5cbe7a/12874_2024_2231_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/a6139227676f/12874_2024_2231_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/d4470d950bc6/12874_2024_2231_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/48f4e0814015/12874_2024_2231_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/8733a6e64653/12874_2024_2231_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6c38/11100078/225e251ee2a3/12874_2024_2231_Fig7_HTML.jpg

相似文献

1
Identification of patients' smoking status using an explainable AI approach: a Danish electronic health records case study.利用可解释 AI 方法识别患者的吸烟状况:丹麦电子健康记录案例研究。
BMC Med Res Methodol. 2024 May 17;24(1):114. doi: 10.1186/s12874-024-02231-4.
2
Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.利用基于深度学习的自然语言处理技术从非结构化电子健康记录中分类社会健康决定因素。
J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.
3
A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
4
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
5
Data mining to retrieve smoking status from electronic health records in general practice.在全科医疗中通过数据挖掘从电子健康记录中检索吸烟状况。
Eur Heart J Digit Health. 2022 May 20;3(3):437-444. doi: 10.1093/ehjdh/ztac031. eCollection 2022 Sep.
6
Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.人工智能通过外部资源学习语义以对出院小结中的诊断代码进行分类。
J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.
7
Automatic Classification of Thyroid Findings Using Static and Contextualized Ensemble Natural Language Processing Systems: Development Study.使用静态和情境化集成自然语言处理系统对甲状腺检查结果进行自动分类:开发研究
JMIR Med Inform. 2021 Sep 21;9(9):e30223. doi: 10.2196/30223.
8
A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.深度学习模型在不同类别不平衡程度的非结构化医疗记录文本分类中的对比研究。
BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.
9
De-identifying free text of Japanese electronic health records.去标识化日本电子健康记录的自由文本。
J Biomed Semantics. 2020 Sep 21;11(1):11. doi: 10.1186/s13326-020-00227-9.
10
Extracting Smoking Status from Electronic Health Records Using NLP and Deep Learning.使用自然语言处理和深度学习从电子健康记录中提取吸烟状态
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:507-516. eCollection 2020.

引用本文的文献

1
SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening.SmokeBERT:一种基于BERT的模型,用于从临床叙述中提取定量吸烟史以改善肺癌筛查
medRxiv. 2025 Jun 20:2025.06.18.25329870. doi: 10.1101/2025.06.18.25329870.
2
Using Natural Language Processing and Machine Learning to classify the status of kidney allograft in Electronic Medical Records written in Spanish.使用自然语言处理和机器学习对西班牙语电子病历中同种异体肾移植的状态进行分类。
PLoS One. 2025 May 8;20(5):e0322587. doi: 10.1371/journal.pone.0322587. eCollection 2025.
3
Development of a deep learning model to predict smoking status in patients with chronic obstructive pulmonary disease: A secondary analysis of cross-sectional national survey.

本文引用的文献

1
A collection of multiregistry data on patients at high risk of lung cancer-a Danish retrospective cohort study of nearly 40,000 patients.一项关于肺癌高危患者的多登记处数据收集——一项对近40000名患者的丹麦回顾性队列研究。
Transl Lung Cancer Res. 2023 Dec 26;12(12):2392-2411. doi: 10.21037/tlcr-23-495. Epub 2023 Dec 22.
2
Data mining to retrieve smoking status from electronic health records in general practice.在全科医疗中通过数据挖掘从电子健康记录中检索吸烟状况。
Eur Heart J Digit Health. 2022 May 20;3(3):437-444. doi: 10.1093/ehjdh/ztac031. eCollection 2022 Sep.
3
A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI.
开发用于预测慢性阻塞性肺疾病患者吸烟状况的深度学习模型:一项全国横断面调查的二次分析
Digit Health. 2025 Apr 15;11:20552076251333660. doi: 10.1177/20552076251333660. eCollection 2025 Jan-Dec.
4
The doctor will polygraph you now.医生现在要给你做测谎检查。
Npj Health Syst. 2024;1(1):1. doi: 10.1038/s44401-024-00001-4. Epub 2024 Dec 5.
5
A Bayesian Network Approach to Lung Cancer Screening: Assessing the Impact of Data Quantity, Quality, and the Combination of Data from Danish Electronic Health Records.一种用于肺癌筛查的贝叶斯网络方法:评估数据量、质量以及丹麦电子健康记录数据组合的影响。
Cancers (Basel). 2024 Nov 28;16(23):3989. doi: 10.3390/cancers16233989.
6
The recent history and near future of digital health in the field of behavioral medicine: an update on progress from 2019 to 2024.行为医学领域数字健康的近期历史与不久的将来:2019年至2024年进展更新
J Behav Med. 2025 Feb;48(1):120-136. doi: 10.1007/s10865-024-00526-x. Epub 2024 Oct 28.
7
The doctor will polygraph you now: ethical concerns with AI for fact-checking patients.医生现在将对你进行测谎:人工智能用于核实患者情况的伦理问题。
ArXiv. 2024 Nov 11:arXiv:2408.07896v2.
可解释人工智能(XAI)研究综述:迈向医学 XAI
IEEE Trans Neural Netw Learn Syst. 2021 Nov;32(11):4793-4813. doi: 10.1109/TNNLS.2020.3027314. Epub 2021 Oct 27.
4
Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records.自然语言处理和机器学习可实现从电子病历中自动提取和分类患者的吸烟状况。
Ups J Med Sci. 2020 Nov;125(4):316-324. doi: 10.1080/03009734.2020.1792010. Epub 2020 Jul 22.
5
Extracting Smoking Status from Electronic Health Records Using NLP and Deep Learning.使用自然语言处理和深度学习从电子健康记录中提取吸烟状态
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:507-516. eCollection 2020.
6
Data mining information from electronic health records produced high yield and accuracy for current smoking status.从电子健康记录中挖掘信息对当前吸烟状况具有较高的产量和准确率。
J Clin Epidemiol. 2020 Feb;118:100-106. doi: 10.1016/j.jclinepi.2019.11.006. Epub 2019 Nov 12.
7
The Danish health care system and epidemiological research: from health care contacts to database records.丹麦医疗保健系统与流行病学研究:从医疗保健接触到数据库记录。
Clin Epidemiol. 2019 Jul 12;11:563-591. doi: 10.2147/CLEP.S179083. eCollection 2019.
8
Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review.电子健康记录中自由文本叙述的症状的自然语言处理:系统评价。
J Am Med Inform Assoc. 2019 Apr 1;26(4):364-379. doi: 10.1093/jamia/ocy173.
9
Random Forest.随机森林
J Insur Med. 2017;47(1):31-39. doi: 10.17849/insm-47-01-31-39.1.
10
Danish Register of chronic obstructive pulmonary disease.丹麦慢性阻塞性肺疾病登记处
Clin Epidemiol. 2016 Oct 25;8:673-678. doi: 10.2147/CLEP.S99489. eCollection 2016.