医学中的机器学习：自然语言处理实用入门。

Machine learning in medicine: a practical introduction to natural language processing.

机构信息

Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK.

MD Anderson Center for INSPiRED Cancer Care, Department of Symptom Research, University of Texas MD Anderson Cancer Center, Houston, TX, USA.

出版信息

BMC Med Res Methodol. 2021 Jul 31;21(1):158. doi: 10.1186/s12874-021-01347-1.

DOI:10.1186/s12874-021-01347-1

PMID:34332525

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8325804/

Abstract

BACKGROUND

Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software.

METHODS

We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity.

RESULTS

Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM.

CONCLUSIONS

In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.

摘要

背景

非结构化文本，包括病历、患者反馈和社交媒体评论，可以成为临床研究的丰富数据来源。自然语言处理（NLP）描述了一组用于将书面文本转换为可解释数据集的技术，这些数据集可以通过统计和机器学习（ML）模型进行分析。本文的目的是提供一种实用的介绍，介绍使用免费软件分析文本数据的现代技术。

方法

我们使用从医学评论网站获得的公开数据进行了三个 NLP 实验。首先，我们对四种药物（左甲状腺素、伟哥、奥司他韦和阿哌沙班）的开放文本患者评论进行基于词汇的情感分析。接下来，我们仅基于评论，使用无监督 ML（潜在狄利克雷分配，LDA）来识别数据集中的类似药物。最后，我们开发了三种监督 ML 算法来预测药物评论是否与正评或负评相关。这些算法是：正则化逻辑回归、支持向量机（SVM）和人工神经网络（ANN）。我们根据分类准确性、接收器操作特征曲线下的面积（AUC）、敏感性和特异性来比较这些算法的性能。

结果

左甲状腺素和伟哥的正面评价比例高于奥司他韦和阿哌沙班。LDA 聚类中的一个明显代表了用于治疗精神健康问题的药物。这个聚类中一个共同的主题是药物需要数周或数月才能起效。另一个聚类则明显代表了用作避孕药的药物。监督机器学习算法预测正面或负面的药物评价，分类准确性从正则化回归的 0.664，95%CI [0.608, 0.716]到 SVM 的 0.720，95%CI [0.664,0.776]不等。

结论

在本文中，我们介绍了用于分析大量文本的常用技术的概念概述，并提供了可重复使用的代码，可通过使用开源软件轻松应用于其他研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e44/8325804/b7c24c28402b/12874_2021_1347_Fig1_HTML.jpg

相似文献

Machine learning in medicine: a practical introduction to natural language processing.

BMC Med Res Methodol. 2021 Jul 31;21(1):158. doi: 10.1186/s12874-021-01347-1.

Machine learning in medicine: a practical introduction.

BMC Med Res Methodol. 2019 Mar 19;19(1):64. doi: 10.1186/s12874-019-0681-4.

Digital Epidemiology of Prescription Drug References on X (Formerly Twitter): Neural Network Topic Modeling and Sentiment Analysis.

J Med Internet Res. 2024 Aug 23;26:e57885. doi: 10.2196/57885.

Risk prediction using natural language processing of electronic mental health records in an inpatient forensic psychiatry setting.

J Biomed Inform. 2018 Oct;86:49-58. doi: 10.1016/j.jbi.2018.08.007. Epub 2018 Aug 14.

Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review.

BMJ Health Care Inform. 2021 Mar;28(1). doi: 10.1136/bmjhci-2020-100262.

Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach.

BMC Med Inform Decis Mak. 2017 Dec 1;17(1):155. doi: 10.1186/s12911-017-0556-8.

A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

Comparison of an Ensemble of Machine Learning Models and the BERT Language Model for Analysis of Text Descriptions of Brain CT Reports to Determine the Presence of Intracranial Hemorrhage.

Sovrem Tekhnologii Med. 2024;16(1):27-34. doi: 10.17691/stm2024.16.1.03. Epub 2024 Feb 28.

Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes.

J Med Internet Res. 2017 Nov 6;19(11):e380. doi: 10.2196/jmir.8344.

Automated Classification of Free-Text Radiology Reports: Using Different Feature Extraction Methods to Identify Fractures of the Distal Fibula.

Rofo. 2023 Aug;195(8):713-719. doi: 10.1055/a-2061-6562. Epub 2023 May 9.

引用本文的文献

Machine Learning Framework for Ovarian Cancer Diagnostics Using Plasma Lipidomics and Metabolomics.

Int J Mol Sci. 2025 Jul 10;26(14):6630. doi: 10.3390/ijms26146630.

Closed circuit artificial ıntelligence model named morgaf for childhood onset systemic lupus erythematosus diagnosis.

Sci Rep. 2025 Jul 1;15(1):20868. doi: 10.1038/s41598-025-92964-z.

AI in Medical Questionnaires: Innovations, Diagnosis, and Implications.

J Med Internet Res. 2025 Jun 23;27:e72398. doi: 10.2196/72398.

Predicting the Higher Energy Need for Effective Defibrillation Using Machine Learning Based on an Animal Model.

J Clin Med. 2025 May 30;14(11):3879. doi: 10.3390/jcm14113879.

Establishing a Validation Framework of Treatment Discontinuation in Claims Data Using Natural Language Processing and Electronic Health Records.

Clin Pharmacol Ther. 2025 Apr 8. doi: 10.1002/cpt.3650.

A case study on generative artificial intelligence to extract the fundamental sleep parameters from polysomnography notes.

J Clin Sleep Med. 2025 Jun 1;21(6):1123-1127. doi: 10.5664/jcsm.11594.

Externally validated and clinically useful machine learning algorithms to support patient-related decision-making in oncology: a scoping review.

BMC Med Res Methodol. 2025 Feb 21;25(1):45. doi: 10.1186/s12874-025-02463-y.

Machine learning tools match physician accuracy in multilingual text annotation.

Sci Rep. 2025 Feb 14;15(1):5487. doi: 10.1038/s41598-025-89754-y.

Identifying abdominal aortic aneurysm size and presence using Natural Language Processing of radiology reports: a systematic review and meta-analysis.

Abdom Radiol (NY). 2025 Jan 30. doi: 10.1007/s00261-025-04810-5.

Artificial Intelligence for Clinical Management of Male Infertility, a Scoping Review.

Curr Urol Rep. 2024 Nov 9;26(1):17. doi: 10.1007/s11934-024-01239-z.

本文引用的文献

COVID-19 prediction models should adhere to methodological and reporting standards.

Eur Respir J. 2020 Sep 10;56(3). doi: 10.1183/13993003.02643-2020. Print 2020 Sep.

Logistic regression has similar performance to optimised machine learning algorithms in a clinical setting: application to the discrimination between type 1 and type 2 diabetes in young adults.

Diagn Progn Res. 2020 Jun 4;4:6. doi: 10.1186/s41512-020-00075-2. eCollection 2020.

Machine learning algorithms performed no better than regression models for prognostication in traumatic brain injury.

J Clin Epidemiol. 2020 Jun;122:95-107. doi: 10.1016/j.jclinepi.2020.03.005. Epub 2020 Mar 20.

Logistic regression was as good as machine learning for predicting major chronic diseases.

J Clin Epidemiol. 2020 Jun;122:56-69. doi: 10.1016/j.jclinepi.2020.03.002. Epub 2020 Mar 10.

Weakly supervised natural language processing for assessing patient-centered outcome following prostate cancer treatment.

JAMIA Open. 2019 Apr;2(1):150-159. doi: 10.1093/jamiaopen/ooy057. Epub 2019 Jan 4.

Reporting of artificial intelligence prediction models.

Lancet. 2019 Apr 20;393(10181):1577-1579. doi: 10.1016/S0140-6736(19)30037-6.

Machine learning in medicine: a practical introduction.

BMC Med Res Methodol. 2019 Mar 19;19(1):64. doi: 10.1186/s12874-019-0681-4.

A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.

J Clin Epidemiol. 2019 Jun;110:12-22. doi: 10.1016/j.jclinepi.2019.02.004. Epub 2019 Feb 11.

PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies.

Ann Intern Med. 2019 Jan 1;170(1):51-58. doi: 10.7326/M18-1376.

Diabetes on Twitter: A Sentiment Analysis.

J Diabetes Sci Technol. 2019 May;13(3):439-444. doi: 10.1177/1932296818811679. Epub 2018 Nov 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

医学中的机器学习：自然语言处理实用入门。

Machine learning in medicine: a practical introduction to natural language processing.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献