Suppr
超能文献

识别患者生成的关于新冠病毒的远程医疗查询的感知严重程度：开发和评估基于迁移学习的解决方案。

Identifying the Perceived Severity of Patient-Generated Telemedical Queries Regarding COVID: Developing and Evaluating a Transfer Learning-Based Solution.

作者信息

Gatto Joseph, Seegmiller Parker, Johnston Garrett, Preum Sarah Masud

机构信息

Department of Computer Science, Dartmouth College, Hanover, NH, United States.

出版信息

JMIR Med Inform. 2022 Sep 2;10(9):e37770. doi: 10.2196/37770.

DOI:10.2196/37770

PMID:35981230

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9446665/

Abstract

BACKGROUND

Triage of textual telemedical queries is a safety-critical task for medical service providers with limited remote health resources. The prioritization of patient queries containing medically severe text is necessary to optimize resource usage and provide care to those with time-sensitive needs.

OBJECTIVE

We aim to evaluate the effectiveness of transfer learning solutions on the task of telemedical triage and provide a thorough error analysis, identifying telemedical queries that challenge state-of-the-art natural language processing (NLP) systems. Additionally, we aim to provide a publicly available telemedical query data set with labels for severity classification for telemedical triage of respiratory issues.

METHODS

We annotated 573 medical queries from 3 online health platforms: HealthTap, HealthcareMagic, and iCliniq. We then evaluated 6 transfer learning solutions utilizing various text-embedding strategies. Specifically, we first established a baseline using a lexical classification model with term frequency-inverse document frequency (TF-IDF) features. Next, we investigated the effectiveness of global vectors for text representation (GloVe), a pretrained word-embedding method. We evaluated the performance of GloVe embeddings in the context of support vector machines (SVMs), bidirectional long short-term memory (bi-LSTM) networks, and hierarchical attention networks (HANs). Finally, we evaluated the performance of contextual text embeddings using transformer-based architectures. Specifically, we evaluated bidirectional encoder representation from transformers (BERT), Bio+Clinical-BERT, and Sentence-BERT (SBERT) on the telemedical triage task.

RESULTS

We found that a simple lexical model achieved a mean F1 score of 0.865 (SD 0.048) on the telemedical triage task. GloVe-based models using SVMs, HANs, and bi-LSTMs achieved a 0.8-, 1.5-, and 2.1-point increase in the F1 score, respectively. Transformer-based models, such as BERT, Bio+Clinical-BERT, and SBERT, achieved a mean F1 score of 0.914 (SD 0.034), 0.904 (SD 0.041), and 0.917 (SD 0.037), respectively. The highest-performing model, SBERT, provided a statistically significant improvement compared to all GloVe-based and lexical baselines. However, no statistical significance was found when comparing transformer-based models. Furthermore, our error analysis revealed highly challenging query types, including those with complex negations, temporal relationships, and patient intents.

CONCLUSIONS

We showed that state-of-the-art transfer learning techniques work well on the telemedical triage task, providing significant performance increase over lexical models. Additionally, we released a public telemedical triage data set using labeled questions from online medical question-and-answer (Q&A) platforms. Our analysis highlights various avenues for future works that explicitly model such query challenges.

摘要

背景

对于远程医疗资源有限的医疗服务提供者而言，对文本形式的远程医疗咨询进行分诊是一项关乎安全的关键任务。对包含医学严重文本的患者咨询进行优先级排序，对于优化资源使用以及为有时间敏感性需求的患者提供护理至关重要。

目的

我们旨在评估迁移学习解决方案在远程医疗分诊任务中的有效性，并进行全面的错误分析，识别对当前最先进的自然语言处理（NLP）系统构成挑战的远程医疗咨询。此外，我们旨在提供一个公开可用的远程医疗咨询数据集，其中带有用于呼吸问题远程医疗分诊严重程度分类的标签。

方法

我们对来自3个在线健康平台（HealthTap、HealthcareMagic和iCliniq）的573条医疗咨询进行了标注。然后，我们评估了6种利用各种文本嵌入策略的迁移学习解决方案。具体而言，我们首先使用具有词频 - 逆文档频率（TF - IDF）特征的词汇分类模型建立基线。接下来，我们研究了用于文本表示的全局向量（GloVe）这种预训练词嵌入方法的有效性。我们在支持向量机（SVM）、双向长短期记忆（bi - LSTM）网络和层次注意力网络（HAN）的背景下评估了GloVe嵌入的性能。最后，我们使用基于Transformer的架构评估上下文文本嵌入的性能。具体来说，我们在远程医疗分诊任务中评估了来自Transformer的双向编码器表示（BERT）、Bio + Clinical - BERT和句子BERT（SBERT）。

结果

我们发现一个简单的词汇模型在远程医疗分诊任务上的平均F1分数为0.865（标准差0.048）。使用SVM、HAN和bi - LSTM的基于GloVe的模型在F1分数上分别提高了0.8、1.5和2.1分。基于Transformer的模型，如BERT、Bio + Clinical - BERT和SBERT，平均F1分数分别为0.914（标准差0.034）、0.904（标准差0.041）和0.917（标准差0.037）。性能最高的模型SBERT与所有基于GloVe的和词汇基线相比，提供了具有统计学意义的改进。然而，在比较基于Transformer的模型时未发现统计学意义。此外，我们的错误分析揭示了极具挑战性的查询类型，包括那些具有复杂否定、时间关系和患者意图的查询。