• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于基于转换器的聚类的上下文双向编码表示发现主题一致的生物医学文档。

Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.

机构信息

School of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea.

School of Computer Science, Northeast Electric Power University, Jilin 132013, China.

出版信息

Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.

DOI:10.3390/ijerph19105893
PMID:35627429
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9141535/
Abstract

The increasing expansion of biomedical documents has increased the number of natural language textual resources related to the current applications. Meanwhile, there has been a great interest in extracting useful information from meaningful coherent groupings of textual content documents in the last decade. However, it is challenging to discover informative representations and define relevant articles from the rapidly growing biomedical literature due to the unsupervised nature of document clustering. Moreover, empirical investigations demonstrated that traditional text clustering methods produce unsatisfactory results in terms of non-contextualized vector space representations because that neglect the semantic relationship between biomedical texts. Recently, pre-trained language models have emerged as successful in a wide range of natural language processing applications. In this paper, we propose the Gaussian Mixture Model-based efficient clustering framework that incorporates substantially pre-trained (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) BioBERT domain-specific language representations to enhance the clustering accuracy. Our proposed framework consists of main three phases. First, classic text pre-processing techniques are used biomedical document data, which crawled from the PubMed repository. Second, representative vectors are extracted from a pre-trained BioBERT language model for biomedical text mining. Third, we employ the Gaussian Mixture Model as a clustering algorithm, which allows us to assign labels for each biomedical document. In order to prove the efficiency of our proposed model, we conducted a comprehensive experimental analysis utilizing several clustering algorithms while combining diverse embedding techniques. Consequently, the experimental results show that the proposed model outperforms the benchmark models by reaching performance measures of Fowlkes mallows score, silhouette coefficient, adjusted rand index, Davies-Bouldin score of 0.7817, 0.3765, 0.4478, 1.6849, respectively. We expect the outcomes of this study will assist domain specialists in comprehending thematically cohesive documents in the healthcare field.

摘要

生物医学文献的不断扩展增加了与当前应用相关的自然语言文本资源的数量。同时,在过去十年中,人们对从有意义的文本内容文档的有组织分组中提取有用信息产生了极大的兴趣。然而,由于文档聚类的无监督性质,从快速增长的生物医学文献中发现信息表示和定义相关文章具有挑战性。此外,实证研究表明,由于忽略了生物医学文本之间的语义关系,传统的文本聚类方法在非上下文化向量空间表示方面产生的结果并不令人满意。最近,预训练语言模型在广泛的自然语言处理应用中取得了成功。在本文中,我们提出了一种基于高斯混合模型的高效聚类框架,该框架结合了大量预训练的(用于生物医学文本挖掘的双向编码器表示转换器)BioBERT 领域特定语言表示,以提高聚类准确性。我们的框架主要包括三个阶段。首先,使用经典的文本预处理技术对从 PubMed 存储库中爬取的生物医学文档数据进行处理。其次,从预训练的 BioBERT 语言模型中提取代表性向量,用于生物医学文本挖掘。最后,我们采用高斯混合模型作为聚类算法,为每个生物医学文档分配标签。为了证明我们提出的模型的效率,我们结合了不同的嵌入技术,利用几种聚类算法进行了全面的实验分析。结果表明,所提出的模型在性能指标上优于基准模型,Fowlkes mallows 得分、轮廓系数、调整兰德指数和 Davies-Bouldin 得分分别达到 0.7817、0.3765、0.4478 和 1.6849。我们期望这项研究的结果将帮助医疗保健领域的领域专家理解主题一致的文档。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/e389802ef580/ijerph-19-05893-g007a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/7a2ede6b147c/ijerph-19-05893-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/241eddb10fc0/ijerph-19-05893-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/56230676439a/ijerph-19-05893-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/8981dce09953/ijerph-19-05893-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/71b897bf04c4/ijerph-19-05893-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/32887769cc2b/ijerph-19-05893-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/e389802ef580/ijerph-19-05893-g007a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/7a2ede6b147c/ijerph-19-05893-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/241eddb10fc0/ijerph-19-05893-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/56230676439a/ijerph-19-05893-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/8981dce09953/ijerph-19-05893-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/71b897bf04c4/ijerph-19-05893-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/32887769cc2b/ijerph-19-05893-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1607/9141535/e389802ef580/ijerph-19-05893-g007a.jpg

相似文献

1
Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.基于基于转换器的聚类的上下文双向编码表示发现主题一致的生物医学文档。
Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.
2
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
3
Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.用于量化生物医学文本摘要是信息内容的深度语境化嵌入。
Comput Methods Programs Biomed. 2020 Feb;184:105117. doi: 10.1016/j.cmpb.2019.105117. Epub 2019 Oct 4.
4
Do syntactic trees enhance Bidirectional Encoder Representations from Transformers (BERT) models for chemical-drug relation extraction?句法树是否能增强用于化学药物关系抽取的基于转换器的双向编码器表示(BERT)模型?
Database (Oxford). 2022 Aug 25;2022. doi: 10.1093/database/baac070.
5
CIBS: A biomedical text summarizer using topic-based sentence clustering.CIBS:一种基于主题的句子聚类的生物医学文本摘要器。
J Biomed Inform. 2018 Dec;88:53-61. doi: 10.1016/j.jbi.2018.11.006. Epub 2018 Nov 13.
6
BERT-based Ranking for Biomedical Entity Normalization.基于BERT的生物医学实体规范化排序
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:269-277. eCollection 2020.
7
A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method.一种用于生物医学文献的基于连贯图的语义聚类与摘要方法及一种新的摘要评估方法。
BMC Bioinformatics. 2007 Nov 27;8 Suppl 9(Suppl 9):S4. doi: 10.1186/1471-2105-8-S9-S4.
8
Graph-based biomedical text summarization: An itemset mining and sentence clustering approach.基于图的生物医学文本摘要:一种基于项集挖掘和句子聚类的方法。
J Biomed Inform. 2018 Aug;84:42-58. doi: 10.1016/j.jbi.2018.06.005. Epub 2018 Jun 15.
9
A knowledge-driven approach to biomedical document conceptualization.基于知识的生物医学文献概念化方法。
Artif Intell Med. 2010 Jun;49(2):67-78. doi: 10.1016/j.artmed.2010.02.005. Epub 2010 Apr 3.
10
Large scale biomedical texts classification: a kNN and an ESA-based approaches.大规模生物医学文本分类:基于k近邻算法和基于词嵌入语义分析的方法。
J Biomed Semantics. 2016 Jun 16;7:40. doi: 10.1186/s13326-016-0073-1.

引用本文的文献

1
Sex, Age, and Patient Experience in Cardiologist Reviews: A Large-Scale Artificial Intelligence-Enabled Analysis.心脏病专家评审中的性别、年龄和患者体验:一项大规模的人工智能分析
JACC Adv. 2024 Jul 3;3(7):101046. doi: 10.1016/j.jacadv.2024.101046. eCollection 2024 Jul.
2
RETRACTED ARTICLE: ELUCNN for explainable COVID-19 diagnosis.撤回文章:用于可解释的新型冠状病毒肺炎诊断的ELUCNN
Soft comput. 2023 Jan 13;28(Suppl 2):455. doi: 10.1007/s00500-023-07813-w. Print 2024 Dec.
3
Identifying and Analyzing Topic Clusters in a Nutri-, Food-, and Diet-Proteomic Corpus Using Machine Reading.

本文引用的文献

1
Darling: A Web Application for Detecting Disease-Related Biomedical Entity Associations with Literature Mining.亲爱的:一个使用文献挖掘技术检测与疾病相关的生物医学实体关联的网络应用程序。
Biomolecules. 2022 Mar 30;12(4):520. doi: 10.3390/biom12040520.
2
A Machine-Learning-Based Bibliometric Analysis of the Scientific Literature on Anal Cancer.基于机器学习的肛门癌科学文献计量分析
Cancers (Basel). 2022 Mar 27;14(7):1697. doi: 10.3390/cancers14071697.
3
A Bioinformatics-Assisted Review on Iron Metabolism and Immune System to Identify Potential Biomarkers of Exercise Stress-Induced Immunosuppression.
利用机器阅读技术在营养、食品和饮食蛋白质组学语料库中识别和分析主题簇。
Nutrients. 2023 Jan 5;15(2):270. doi: 10.3390/nu15020270.
一项关于铁代谢与免疫系统的生物信息学辅助综述,以确定运动应激诱导免疫抑制的潜在生物标志物。
Biomedicines. 2022 Mar 21;10(3):724. doi: 10.3390/biomedicines10030724.
4
Healthcare Management: A Bibliometric Analysis Based on the Citations of Research Articles Published between 1967 and 2020.医疗保健管理:基于1967年至2020年发表的研究文章引用情况的文献计量分析
Healthcare (Basel). 2022 Mar 16;10(3):555. doi: 10.3390/healthcare10030555.
5
A Computational Framework to Analyze the Associations Between Symptoms and Cancer Patient Attributes Post Chemotherapy Using EHR Data.一种利用电子健康记录(EHR)数据分析化疗后症状与癌症患者属性之间关联的计算框架。
IEEE J Biomed Health Inform. 2021 Nov;25(11):4098-4109. doi: 10.1109/JBHI.2021.3117238. Epub 2021 Nov 5.
6
Doctor Recommendation Model Based on Ontology Characteristics and Disease Text Mining Perspective.基于本体特征和疾病文本挖掘视角的医生推荐模型。
Biomed Res Int. 2021 Aug 8;2021:7431199. doi: 10.1155/2021/7431199. eCollection 2021.
7
Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya.肯尼亚内罗毕大学科学出版物的两阶段主题建模:案例研究。
PLoS One. 2021 Jan 7;16(1):e0243208. doi: 10.1371/journal.pone.0243208. eCollection 2021.
8
Atlas: automatic modeling of regulation of bacterial gene expression and metabolism using rule-based languages.阿特拉斯:使用基于规则的语言对细菌基因表达和代谢调控进行自动建模。
Bioinformatics. 2021 Apr 1;36(22-23):5473-5480. doi: 10.1093/bioinformatics/btaa1040.
9
Supporting topic modeling and trends analysis in biomedical literature.支持生物医学文献中的主题建模和趋势分析。
J Biomed Inform. 2020 Oct;110:103574. doi: 10.1016/j.jbi.2020.103574. Epub 2020 Sep 21.
10
Semantic text mining in early drug discovery for type 2 diabetes.2 型糖尿病早期药物发现中的语义文本挖掘。
PLoS One. 2020 Jun 15;15(6):e0233956. doi: 10.1371/journal.pone.0233956. eCollection 2020.