文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

一种使用弱监督和深度表示的临床文本分类范式。

A clinical text classification paradigm using weak supervision and deep representation.

机构信息

Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN, 55905, USA.

Division of Rheumatology, Department of Medicine, Mayo Clinic, 200 1st ST SW, Rochester, MN, 55905, USA.

出版信息

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.


DOI:10.1186/s12911-018-0723-6
PMID:30616584
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6322223/
Abstract

BACKGROUND: Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. METHODS: We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. RESULTS: CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. CONCLUSION: The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.

摘要

背景:自动临床文本分类是一种自然语言处理(NLP)技术,可挖掘临床叙述中嵌入的信息。机器学习方法已被证明可有效用于临床文本分类任务。然而,成功的机器学习模型通常需要大量人力来创建标记训练数据并进行特征工程。在这项研究中,我们提出了一种使用弱监督和深度表示的临床文本分类范例,以减少这些人工工作。

方法:我们开发了一种基于规则的 NLP 算法,可自动为训练数据生成标签,然后使用预先训练的词向量作为深度表示特征来训练机器学习模型。由于机器学习是基于自动 NLP 算法生成的标签进行训练的,因此这种训练过程称为弱监督。我们在 Mayo 诊所的两个机构案例研究中评估了该范例的有效性:吸烟状况分类和股骨近端(髋部)骨折分类,以及一个使用公共数据集的案例研究:i2b2 2006 年吸烟状况分类共享任务。我们使用此范例测试了四种广泛使用的机器学习模型,即支持向量机(SVM)、随机森林(RF)、多层感知机神经网络(MLPNN)和卷积神经网络(CNN)。使用精度、召回率和 F1 分数作为指标来评估性能。

结果:CNN 在两个机构任务中均取得了最佳性能(Mayo 诊所吸烟状况分类的 F1 得分为 0.92,骨折分类的 F1 得分为 0.97)。我们表明,在该范例中,词向量明显优于 tf-idf 和主题建模特征,并且 CNN 从弱监督中捕获了比基于规则的 NLP 算法更多的模式。我们还观察到该范例的两个缺点,即 CNN 对训练数据的大小更敏感,并且该范例可能不适用于复杂的多类分类任务。

结论:该临床文本分类范例可以通过利用弱监督和深度表示来减少应用机器学习进行临床文本分类的标记训练数据创建和特征工程的人工工作。通过两个机构和一个共享的临床文本分类任务的实验验证了范例的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/35ef7a5b000d/12911_2018_723_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/4cd4039566ab/12911_2018_723_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/6b172461318d/12911_2018_723_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/f1ad7545e592/12911_2018_723_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/35ef7a5b000d/12911_2018_723_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/4cd4039566ab/12911_2018_723_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/6b172461318d/12911_2018_723_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/f1ad7545e592/12911_2018_723_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b112/6322223/35ef7a5b000d/12911_2018_723_Fig4_HTML.jpg

相似文献

[1]
A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019-1-7

[2]
Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.

J Am Med Inform Assoc. 2019-11-1

[3]
Classifying the lifestyle status for Alzheimer's disease from clinical notes using deep learning with weak supervision.

BMC Med Inform Decis Mak. 2022-7-7

[4]
Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach.

BMC Med Inform Decis Mak. 2017-12-1

[5]
Identification of patients' smoking status using an explainable AI approach: a Danish electronic health records case study.

BMC Med Res Methodol. 2024-5-17

[6]
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

J Am Med Inform Assoc. 2020-1-1

[7]
A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018-9-12

[8]
Natural Language Processing for Imaging Protocol Assignment: Machine Learning for Multiclass Classification of Abdominal CT Protocols Using Indication Text Data.

J Digit Imaging. 2022-10

[9]
Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.

J Med Internet Res. 2017-11-6

[10]
Prediction of Stroke Outcome Using Natural Language Processing-Based Machine Learning of Radiology Report of Brain MRI.

J Pers Med. 2020-12-16

引用本文的文献

[1]
Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing.

AMIA Jt Summits Transl Sci Proc. 2025-6-10

[2]
Sparse vertex discriminant analysis: Variable selection for biomedical classification applications.

Comput Stat Data Anal. 2025-6

[3]
Enhanced effective convolutional attention network with squeeze-and-excitation inception module for multi-label clinical document classification.

Sci Rep. 2025-5-16

[4]
AI approaches for phenotyping Alzheimer's disease and related dementias using electronic health records.

Alzheimers Dement (N Y). 2025-4-24

[5]
Development and testing of an open source mobile application for audiometry test result analysis and diagnosis support.

Sci Rep. 2025-4-24

[6]
Multiple instance learning-based prediction of programmed death-ligand 1 (PD-L1) expression from hematoxylin and eosin (H&E)-stained histopathological images in breast cancer.

PeerJ. 2025-4-15

[7]
Understanding patterns of loneliness in older long-term care users using natural language processing with free text case notes.

PLoS One. 2025-4-2

[8]
Using Natural Language Processing Methods to Predict Topics Included in 2019 Ohio Syphilis Disease Intervention Specialist Records.

Sex Transm Dis. 2025-6-1

[9]
Leveraging transformer models to predict cognitive impairment: accuracy, efficiency, and interpretability.

BMC Public Health. 2025-2-7

[10]
Prediction of stunting and its socioeconomic determinants among adolescent girls in Ethiopia using machine learning algorithms.

PLoS One. 2025-1-24

本文引用的文献

[1]
Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts.

Database (Oxford). 2017-1-1

[2]
A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018-9-12

[3]
Learning to Detect Blue-White Structures in Dermoscopy Images With Weak Supervision.

IEEE J Biomed Health Inform. 2018-5-10

[4]
Data Programming: Creating Large Training Sets, Quickly.

Adv Neural Inf Process Syst. 2016-12

[5]
Leveraging Collaborative Filtering to Accelerate Rare Disease Diagnosis.

AMIA Annu Symp Proc. 2018-4-16

[6]
Opportunities and obstacles for deep learning in biology and medicine.

J R Soc Interface. 2018-4

[7]
Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings.

Proc IEEE Int Symp Bioinformatics Bioeng. 2017-10

[8]
Systematic identification of latent disease-gene associations from PubMed articles.

PLoS One. 2018-1-26

[9]
A Natural Language Processing System That Links Medical Terms in Electronic Health Record Notes to Lay Definitions: System Development Using Physician Reviews.

J Med Internet Res. 2018-1-22

[10]
Using Human Phenotype Ontology for Phenotypic Analysis of Clinical Notes.

Stud Health Technol Inform. 2017

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索