医学中的机器学习:自然语言处理实用入门。

Machine learning in medicine: a practical introduction to natural language processing.

机构信息

Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK.

MD Anderson Center for INSPiRED Cancer Care, Department of Symptom Research, University of Texas MD Anderson Cancer Center, Houston, TX, USA.

出版信息

BMC Med Res Methodol. 2021 Jul 31;21(1):158. doi: 10.1186/s12874-021-01347-1.

Abstract

BACKGROUND

Unstructured text, including medical records, patient feedback, and social media comments, can be a rich source of data for clinical research. Natural language processing (NLP) describes a set of techniques used to convert passages of written text into interpretable datasets that can be analysed by statistical and machine learning (ML) models. The purpose of this paper is to provide a practical introduction to contemporary techniques for the analysis of text-data, using freely-available software.

METHODS

We performed three NLP experiments using publicly-available data obtained from medicine review websites. First, we conducted lexicon-based sentiment analysis on open-text patient reviews of four drugs: Levothyroxine, Viagra, Oseltamivir and Apixaban. Next, we used unsupervised ML (latent Dirichlet allocation, LDA) to identify similar drugs in the dataset, based solely on their reviews. Finally, we developed three supervised ML algorithms to predict whether a drug review was associated with a positive or negative rating. These algorithms were: a regularised logistic regression, a support vector machine (SVM), and an artificial neural network (ANN). We compared the performance of these algorithms in terms of classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity and specificity.

RESULTS

Levothyroxine and Viagra were reviewed with a higher proportion of positive sentiments than Oseltamivir and Apixaban. One of the three LDA clusters clearly represented drugs used to treat mental health problems. A common theme suggested by this cluster was drugs taking weeks or months to work. Another cluster clearly represented drugs used as contraceptives. Supervised machine learning algorithms predicted positive or negative drug ratings with classification accuracies ranging from 0.664, 95% CI [0.608, 0.716] for the regularised regression to 0.720, 95% CI [0.664,0.776] for the SVM.

CONCLUSIONS

In this paper, we present a conceptual overview of common techniques used to analyse large volumes of text, and provide reproducible code that can be readily applied to other research studies using open-source software.

摘要

背景

非结构化文本,包括病历、患者反馈和社交媒体评论,可以成为临床研究的丰富数据来源。自然语言处理(NLP)描述了一组用于将书面文本转换为可解释数据集的技术,这些数据集可以通过统计和机器学习(ML)模型进行分析。本文的目的是提供一种实用的介绍,介绍使用免费软件分析文本数据的现代技术。

方法

我们使用从医学评论网站获得的公开数据进行了三个 NLP 实验。首先,我们对四种药物(左甲状腺素、伟哥、奥司他韦和阿哌沙班)的开放文本患者评论进行基于词汇的情感分析。接下来,我们仅基于评论,使用无监督 ML(潜在狄利克雷分配,LDA)来识别数据集中的类似药物。最后,我们开发了三种监督 ML 算法来预测药物评论是否与正评或负评相关。这些算法是:正则化逻辑回归、支持向量机(SVM)和人工神经网络(ANN)。我们根据分类准确性、接收器操作特征曲线下的面积(AUC)、敏感性和特异性来比较这些算法的性能。

结果

左甲状腺素和伟哥的正面评价比例高于奥司他韦和阿哌沙班。LDA 聚类中的一个明显代表了用于治疗精神健康问题的药物。这个聚类中一个共同的主题是药物需要数周或数月才能起效。另一个聚类则明显代表了用作避孕药的药物。监督机器学习算法预测正面或负面的药物评价,分类准确性从正则化回归的 0.664,95%CI [0.608, 0.716]到 SVM 的 0.720,95%CI [0.664,0.776]不等。

结论

在本文中,我们介绍了用于分析大量文本的常用技术的概念概述,并提供了可重复使用的代码,可通过使用开源软件轻松应用于其他研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e44/8325804/b7c24c28402b/12874_2021_1347_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索