用于医学问题文本的多特征融合双目标自监督聚类

Double-target self-supervised clustering with multi-feature fusion for medical question texts.

作者信息

Shen Xifeng, Sun Yuanyuan, Zhang Chunxia, Yang Cheng, Qin Yi, Zhang Weining, Nan Jiale, Che Meiling, Gao Dongping

机构信息

Institute of Medical Information, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China.

National Center for Healthcare Quality Management in Rare Diseases, Virtual Human Platform, National Infrastructures for Translational Medicine, Institute of Clinical Medicine & Chinese Academy of Medical Sciences and Peking Union Medical College Hospital, Beijing, China.

出版信息

PeerJ Comput Sci. 2024 Jun 28;10:e2075. doi: 10.7717/peerj-cs.2075. eCollection 2024.

DOI:10.7717/peerj-cs.2075

PMID:39669457

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11637006/

Abstract

BACKGROUND

To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content.

METHODS

Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, , the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC.

RESULTS

The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346, 0.4934, 0.8649 and 0.5737, respectively.

摘要

背景

为了使问题文本能够表示更多信息并构建一个端到端的文本聚类模型，我们针对描述医学领域相关问题的文本，提出了一种具有多特征融合的双目标自监督聚类方法（MF-DSC）。由于医学问答数据是无结构文本，具有字符短和语言使用不规范的特点，单一模型提取的特征不能完全表征文本内容。

方法

首先，基于词频获得词权重，并根据词汇语义信息生成词向量。然后我们融合词频和词汇语义以获得加权词向量，将其作为深度学习模型的输入。同时，引入自注意力机制来计算问题文本中每个词的权重，即词之间的相互作用。为了学习融合跨文档主题特征并构建端到端的文本聚类模型，构建了两个目标函数L_cluster和L_topic，并将它们集成到一个统一的聚类框架中，这也有助于学习便于文本聚类的友好表示。之后，我们与其他五个模型进行了对比实验，以验证MF-DSC的有效性。

结果

MF-DSC在归一化互信息（NMI）、调整兰德指数（ARI）、平均聚类准确率（ACC）和F1值方面分别优于其他模型，其值分别为0.4346、0.4934、0.8649和0.5737。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3daf/11637006/b16012bee44c/peerj-cs-10-2075-g001.jpg

相似文献

Double-target self-supervised clustering with multi-feature fusion for medical question texts.用于医学问题文本的多特征融合双目标自监督聚类

PeerJ Comput Sci. 2024 Jun 28;10:e2075. doi: 10.7717/peerj-cs.2075. eCollection 2024.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Text Matching in Insurance Question-Answering Community Based on an Integrated BiLSTM-TextCNN Model Fusing Multi-Feature.基于融合多特征的集成双向长短期记忆网络-文本卷积神经网络模型的保险问答社区文本匹配

Entropy (Basel). 2023 Apr 10;25(4):639. doi: 10.3390/e25040639.

Chinese text classification method based on sentence information enhancement and feature fusion.基于句子信息增强与特征融合的中文文本分类方法

Heliyon. 2024 Aug 24;10(17):e36861. doi: 10.1016/j.heliyon.2024.e36861. eCollection 2024 Sep 15.

Social media network public opinion emotion classification method based on multi-feature fusion and multi-scale hybrid neural network.基于多特征融合与多尺度混合神经网络的社交媒体网络舆情情感分类方法

PeerJ Comput Sci. 2025 Jan 28;11:e2643. doi: 10.7717/peerj-cs.2643. eCollection 2025.

A Topic Recognition Method of News Text Based on Word Embedding Enhancement.基于词向量增强的新闻文本主题识别方法。

Comput Intell Neurosci. 2022 Feb 16;2022:4582480. doi: 10.1155/2022/4582480. eCollection 2022.

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation.基于多语义特征，利用经过稳健优化的基于变换器预训练方法的全词掩码和卷积神经网络从电子病历中进行中文临床命名实体识别：模型开发与验证

JMIR Med Inform. 2023 May 10;11:e44597. doi: 10.2196/44597.

Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.基于基于转换器的聚类的上下文双向编码表示发现主题一致的生物医学文档。

Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.

EMFSA: Emoji-based multifeature fusion sentiment analysis.EMFSA：基于表情符号的多特征融合情感分析。

PLoS One. 2024 Sep 19;19(9):e0310715. doi: 10.1371/journal.pone.0310715. eCollection 2024.

MIFAM-DTI: a drug-target interactions predicting model based on multi-source information fusion and attention mechanism.MIFAM-DTI：一种基于多源信息融合和注意力机制的药物-靶点相互作用预测模型。

Front Genet. 2024 May 6;15:1381997. doi: 10.3389/fgene.2024.1381997. eCollection 2024.

本文引用的文献

Image clustering using local discriminant models and global integration.基于局部判别模型和全局集成的图像聚类。

IEEE Trans Image Process. 2010 Oct;19(10):2761-73. doi: 10.1109/TIP.2010.2049235. Epub 2010 Apr 26.

Semantics-preserving bag-of-words models and applications.保留语义的词袋模型及其应用。

IEEE Trans Image Process. 2010 Jul;19(7):1908-20. doi: 10.1109/TIP.2010.2045169. Epub 2010 Mar 11.

Long short-term memory.长短期记忆

Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于医学问题文本的多特征融合双目标自监督聚类

Double-target self-supervised clustering with multi-feature fusion for medical question texts.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

背景

方法

结果

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献