School of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea.
School of Computer Science, Northeast Electric Power University, Jilin 132013, China.
Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.
The increasing expansion of biomedical documents has increased the number of natural language textual resources related to the current applications. Meanwhile, there has been a great interest in extracting useful information from meaningful coherent groupings of textual content documents in the last decade. However, it is challenging to discover informative representations and define relevant articles from the rapidly growing biomedical literature due to the unsupervised nature of document clustering. Moreover, empirical investigations demonstrated that traditional text clustering methods produce unsatisfactory results in terms of non-contextualized vector space representations because that neglect the semantic relationship between biomedical texts. Recently, pre-trained language models have emerged as successful in a wide range of natural language processing applications. In this paper, we propose the Gaussian Mixture Model-based efficient clustering framework that incorporates substantially pre-trained (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) BioBERT domain-specific language representations to enhance the clustering accuracy. Our proposed framework consists of main three phases. First, classic text pre-processing techniques are used biomedical document data, which crawled from the PubMed repository. Second, representative vectors are extracted from a pre-trained BioBERT language model for biomedical text mining. Third, we employ the Gaussian Mixture Model as a clustering algorithm, which allows us to assign labels for each biomedical document. In order to prove the efficiency of our proposed model, we conducted a comprehensive experimental analysis utilizing several clustering algorithms while combining diverse embedding techniques. Consequently, the experimental results show that the proposed model outperforms the benchmark models by reaching performance measures of Fowlkes mallows score, silhouette coefficient, adjusted rand index, Davies-Bouldin score of 0.7817, 0.3765, 0.4478, 1.6849, respectively. We expect the outcomes of this study will assist domain specialists in comprehending thematically cohesive documents in the healthcare field.
生物医学文献的不断扩展增加了与当前应用相关的自然语言文本资源的数量。同时,在过去十年中,人们对从有意义的文本内容文档的有组织分组中提取有用信息产生了极大的兴趣。然而,由于文档聚类的无监督性质,从快速增长的生物医学文献中发现信息表示和定义相关文章具有挑战性。此外,实证研究表明,由于忽略了生物医学文本之间的语义关系,传统的文本聚类方法在非上下文化向量空间表示方面产生的结果并不令人满意。最近,预训练语言模型在广泛的自然语言处理应用中取得了成功。在本文中,我们提出了一种基于高斯混合模型的高效聚类框架,该框架结合了大量预训练的(用于生物医学文本挖掘的双向编码器表示转换器)BioBERT 领域特定语言表示,以提高聚类准确性。我们的框架主要包括三个阶段。首先,使用经典的文本预处理技术对从 PubMed 存储库中爬取的生物医学文档数据进行处理。其次,从预训练的 BioBERT 语言模型中提取代表性向量,用于生物医学文本挖掘。最后,我们采用高斯混合模型作为聚类算法,为每个生物医学文档分配标签。为了证明我们提出的模型的效率,我们结合了不同的嵌入技术,利用几种聚类算法进行了全面的实验分析。结果表明,所提出的模型在性能指标上优于基准模型,Fowlkes mallows 得分、轮廓系数、调整兰德指数和 Davies-Bouldin 得分分别达到 0.7817、0.3765、0.4478 和 1.6849。我们期望这项研究的结果将帮助医疗保健领域的领域专家理解主题一致的文档。