优化小数据集的词向量：以乳腺癌患者的患者门户消息为例的研究。

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.

机构信息

Vanderbilt University, Nashville, TN, 37240, USA.

Brown University, Providence, RI, 02903, USA.

出版信息

Sci Rep. 2024 Jul 12;14(1):16117. doi: 10.1038/s41598-024-66319-z.

DOI:10.1038/s41598-024-66319-z

PMID:38997332

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11245534/

Abstract

Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. We introduce a novel adaptation of the word2vec model, PK-word2vec (where PK stands for prior knowledge), for small-scale messages. PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec in a case study of patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. The dataset was composed of 1389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7981 non-medical and 1116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p = 0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks' choices between the two groups of reviewers ( under a paired t-test). PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.

摘要

患者门户消息通常与特定的临床现象有关（例如，正在接受乳腺癌治疗的患者），因此在生物医学研究中受到越来越多的关注。这些消息需要自然语言处理，虽然词嵌入模型（如 word2vec）有可能从文本中提取有意义的信号，但它们不适用于患者门户消息。这是因为嵌入模型通常需要数百万个训练样本才能充分表示语义，而与特定临床现象相关的患者门户消息量通常相对较小。我们引入了一种 word2vec 模型的新颖适应方法，即 PK-word2vec（其中 PK 代表先验知识），用于小规模消息。PK-word2vec 将最相似的医疗词汇（包括问题、治疗和测试）和非医疗词汇与两个预先训练的嵌入模型中的词汇相结合，作为先验知识，以改进训练过程。我们在范德比尔特大学医学中心电子健康记录系统中对 2004 年 12 月至 2017 年 11 月期间被诊断患有乳腺癌的患者发送的患者门户消息进行了案例研究。我们通过一组 1000 个任务来评估该模型，每个任务都将给定单词的相关性与 PK-word2vec 生成的五个最相似单词组和标准 word2vec 模型生成的五个最相似单词组进行了比较。我们招募了 200 名亚马逊机械土耳其（AMT）工人和 7 名医学生来完成任务。数据集由 1389 份病历组成，包含 137554 条消息和 10683 个独特单词。有 7981 个非医疗和 1116 个医疗词汇可供先验知识使用。在超过 90%的任务中，两位审查员都表示 PK-word2vec 生成的单词比标准 word2vec 更相似（p=0.01）。在两个审查员组之间的任务选择比较中，AMT 工人和医学生的评估差异可以忽略不计（在配对 t 检验下）。PK-word2vec 可以有效地从小消息语料库中学习单词表示，这标志着处理患者门户消息的重大进展。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

优化小数据集的词向量：以乳腺癌患者的患者门户消息为例的研究。

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.

机构信息

出版信息

相似文献

本文引用的文献

优化小数据集的词向量：以乳腺癌患者的患者门户消息为例的研究。

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients.

机构信息

出版信息

相似文献

本文引用的文献