Shen Xifeng, Sun Yuanyuan, Zhang Chunxia, Yang Cheng, Qin Yi, Zhang Weining, Nan Jiale, Che Meiling, Gao Dongping
Institute of Medical Information, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China.
National Center for Healthcare Quality Management in Rare Diseases, Virtual Human Platform, National Infrastructures for Translational Medicine, Institute of Clinical Medicine & Chinese Academy of Medical Sciences and Peking Union Medical College Hospital, Beijing, China.
PeerJ Comput Sci. 2024 Jun 28;10:e2075. doi: 10.7717/peerj-cs.2075. eCollection 2024.
To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content.
Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, , the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC.
The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346, 0.4934, 0.8649 and 0.5737, respectively.
为了使问题文本能够表示更多信息并构建一个端到端的文本聚类模型,我们针对描述医学领域相关问题的文本,提出了一种具有多特征融合的双目标自监督聚类方法(MF-DSC)。由于医学问答数据是无结构文本,具有字符短和语言使用不规范的特点,单一模型提取的特征不能完全表征文本内容。
首先,基于词频获得词权重,并根据词汇语义信息生成词向量。然后我们融合词频和词汇语义以获得加权词向量,将其作为深度学习模型的输入。同时,引入自注意力机制来计算问题文本中每个词的权重,即词之间的相互作用。为了学习融合跨文档主题特征并构建端到端的文本聚类模型,构建了两个目标函数L_cluster和L_topic,并将它们集成到一个统一的聚类框架中,这也有助于学习便于文本聚类的友好表示。之后,我们与其他五个模型进行了对比实验,以验证MF-DSC的有效性。
MF-DSC在归一化互信息(NMI)、调整兰德指数(ARI)、平均聚类准确率(ACC)和F1值方面分别优于其他模型,其值分别为0.4346、0.4934、0.8649和0.5737。