• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

WE聚类:用于大型数据集的基于词嵌入的文本聚类技术。

WEClustering: word embeddings based text clustering technique for large datasets.

作者信息

Mehta Vivek, Bawa Seema, Singh Jasmeet

机构信息

Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab 147001 India.

出版信息

Complex Intell Systems. 2021;7(6):3211-3224. doi: 10.1007/s40747-021-00512-9. Epub 2021 Sep 7.

DOI:10.1007/s40747-021-00512-9
PMID:34777978
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8421191/
Abstract

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named "Bidirectional Encoders Representations using Transformers". The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

摘要

现在,大量的文本数据以研究文章、新闻文章、评论、维基百科文章和书籍等形式存在于数字存储库中。文本聚类是一种用于执行分类、主题提取和信息检索的基本数据挖掘技术。文本数据集,尤其是包含大量文档的数据集,是稀疏的且具有高维度。因此,传统的聚类技术,如K均值、凝聚聚类和DBSCAN,表现不佳。本文提出了一种特别适用于大型文本数据集的聚类技术,该技术克服了这些局限性。所提出的技术基于从最近一种名为“基于变换器的双向编码器表示”的深度学习模型中导出的词嵌入。所提出的技术被命名为WEClustering。所提出的技术以有效方式处理高维度问题,因此形成了更准确的聚类。该技术在几个不同大小的数据集上进行了验证,并将其性能与其他广泛使用的和最先进的聚类技术进行了比较。实验比较表明,所提出的聚类技术在纯度和调整兰德指数等指标衡量下,比其他技术有显著改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/a373c98cf70b/40747_2021_512_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/fbaec2dfae30/40747_2021_512_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/02192170ecaa/40747_2021_512_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/1df80f5ce31a/40747_2021_512_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/95f39007dacc/40747_2021_512_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/04bd78e4ad97/40747_2021_512_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/015556f3b9b1/40747_2021_512_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/4f72041efe85/40747_2021_512_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/080fabbfc034/40747_2021_512_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/a373c98cf70b/40747_2021_512_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/fbaec2dfae30/40747_2021_512_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/02192170ecaa/40747_2021_512_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/1df80f5ce31a/40747_2021_512_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/95f39007dacc/40747_2021_512_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/04bd78e4ad97/40747_2021_512_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/015556f3b9b1/40747_2021_512_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/4f72041efe85/40747_2021_512_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/080fabbfc034/40747_2021_512_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e660/8421191/a373c98cf70b/40747_2021_512_Fig9_HTML.jpg

相似文献

1
WEClustering: word embeddings based text clustering technique for large datasets.WE聚类:用于大型数据集的基于词嵌入的文本聚类技术。
Complex Intell Systems. 2021;7(6):3211-3224. doi: 10.1007/s40747-021-00512-9. Epub 2021 Sep 7.
2
An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.
3
The performance of BERT as data representation of text clustering.作为文本聚类数据表示的BERT性能。
J Big Data. 2022;9(1):15. doi: 10.1186/s40537-022-00564-9. Epub 2022 Feb 8.
4
Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.基于基于转换器的聚类的上下文双向编码表示发现主题一致的生物医学文档。
Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.
5
Self-Taught convolutional neural networks for short text clustering.用于短文本聚类的自学卷积神经网络。
Neural Netw. 2017 Apr;88:22-31. doi: 10.1016/j.neunet.2016.12.008. Epub 2017 Jan 12.
6
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
7
BERT-based Ranking for Biomedical Entity Normalization.基于BERT的生物医学实体规范化排序
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:269-277. eCollection 2020.
8
Textual emotion detection utilizing a transfer learning approach.利用迁移学习方法进行文本情感检测。
J Supercomput. 2023 Mar 22:1-15. doi: 10.1007/s11227-023-05168-5.
9
Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.用于量化生物医学文本摘要是信息内容的深度语境化嵌入。
Comput Methods Programs Biomed. 2020 Feb;184:105117. doi: 10.1016/j.cmpb.2019.105117. Epub 2019 Oct 4.
10
An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem.一种改进的 B 山攀升优化技术,用于解决文本文件聚类问题。
Curr Med Imaging. 2020;16(4):296-306. doi: 10.2174/1573405614666180903112541.

引用本文的文献

1
Benchmarking Transformer Embedding Models for Biomedical Terminology Standardization.用于生物医学术语标准化的基准测试变压器嵌入模型
Mach Learn Appl. 2025 Sep;21. doi: 10.1016/j.mlwa.2025.100683. Epub 2025 Jun 5.