• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于主成分分析的多视图表示技术增强短文本聚类。

A multi-view representation technique based on principal component analysis for enhanced short text clustering.

机构信息

CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.

Ministry of Higher Education and Scientific Research, Baghdad, Iraq.

出版信息

PLoS One. 2024 Aug 23;19(8):e0309206. doi: 10.1371/journal.pone.0309206. eCollection 2024.

DOI:10.1371/journal.pone.0309206
PMID:39178180
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11343383/
Abstract

Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.

摘要

将文本聚类在一起是数据挖掘和信息检索中的一项基本任务,其目的是将未标记的文本分组到有意义的簇中,从而方便从大量文本数据中提取和理解有用信息。然而,对短文本(STC)进行聚类是复杂的,因为它们通常包含稀疏、模糊、嘈杂和缺乏信息。STC 的一个挑战是为短文本文档找到合适的表示形式,以生成内聚的簇。然而,STC 通常只考虑单一视图表示来进行聚类。由于无法表示目标文本的不同方面,单一视图表示对于表示文本效率低下。在本文中,我们提出了最合适的多视图表示(MVR)(通过找到不同单一视图表示的最佳组合)来增强 STC。我们的工作将探索基于不同单一视图表示组合的不同类型的 MVR。单一视图表示的组合是通过主成分分析(PCA)技术的固定长度串联完成的。使用三个标准数据集(Twitter、Google News 和 StackOverflow)来评估不同 MVR 集在 STC 上的性能。根据实验结果,最佳的单一视图表示组合作为 STC 的有效表示是 5 视图 MVR(BERT、GPT、TF-IDF、FastText 和 GloVe 的组合)。基于此,我们可以得出结论,MVR 提高了 STC 的性能;然而,MVR 的设计需要有选择性的单一视图表示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/56a079c539f7/pone.0309206.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/78ca609c12b4/pone.0309206.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/4dad62beadc8/pone.0309206.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/cc44e8f2c7fb/pone.0309206.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/b8a1e2a1f5da/pone.0309206.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/23271c06c017/pone.0309206.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/b8259a7b9baf/pone.0309206.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/556d0d54a71f/pone.0309206.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/c0cdfd259b23/pone.0309206.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/56a079c539f7/pone.0309206.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/78ca609c12b4/pone.0309206.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/4dad62beadc8/pone.0309206.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/cc44e8f2c7fb/pone.0309206.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/b8a1e2a1f5da/pone.0309206.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/23271c06c017/pone.0309206.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/b8259a7b9baf/pone.0309206.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/556d0d54a71f/pone.0309206.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/c0cdfd259b23/pone.0309206.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5565/11343383/56a079c539f7/pone.0309206.g009.jpg

相似文献

1
A multi-view representation technique based on principal component analysis for enhanced short text clustering.基于主成分分析的多视图表示技术增强短文本聚类。
PLoS One. 2024 Aug 23;19(8):e0309206. doi: 10.1371/journal.pone.0309206. eCollection 2024.
2
Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.基于基于转换器的聚类的上下文双向编码表示发现主题一致的生物医学文档。
Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.
3
Self-Taught convolutional neural networks for short text clustering.用于短文本聚类的自学卷积神经网络。
Neural Netw. 2017 Apr;88:22-31. doi: 10.1016/j.neunet.2016.12.008. Epub 2017 Jan 12.
4
Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。
Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.
5
The performance of BERT as data representation of text clustering.作为文本聚类数据表示的BERT性能。
J Big Data. 2022;9(1):15. doi: 10.1186/s40537-022-00564-9. Epub 2022 Feb 8.
6
Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec.基于改进 TF-IDF 和 Word2Vec 的旅游景点子类别的文本分类算法。
PLoS One. 2024 Oct 18;19(10):e0305095. doi: 10.1371/journal.pone.0305095. eCollection 2024.
7
WEClustering: word embeddings based text clustering technique for large datasets.WE聚类:用于大型数据集的基于词嵌入的文本聚类技术。
Complex Intell Systems. 2021;7(6):3211-3224. doi: 10.1007/s40747-021-00512-9. Epub 2021 Sep 7.
8
Solving text clustering problem using a memetic differential evolution algorithm.使用进化算法求解文本聚类问题。
PLoS One. 2020 Jun 11;15(6):e0232816. doi: 10.1371/journal.pone.0232816. eCollection 2020.
9
Enhancing web search result clustering model based on multiview multirepresentation consensus cluster ensemble (mmcc) approach.基于多视图多表示共识聚类集成(mmcc)方法的增强型网络搜索结果聚类模型。
PLoS One. 2021 Jan 15;16(1):e0245264. doi: 10.1371/journal.pone.0245264. eCollection 2021.
10
Beyond Low-Rank Representations: Orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering.超越低秩表示:基于优化图结构的正交聚类基重建的多视图谱聚类。
Neural Netw. 2018 Jul;103:1-8. doi: 10.1016/j.neunet.2018.03.006. Epub 2018 Mar 20.

本文引用的文献

1
An Ensemble and Multi-View Clustering Method Based on Kolmogorov Complexity.一种基于柯尔莫哥洛夫复杂性的集成与多视图聚类方法。
Entropy (Basel). 2023 Feb 17;25(2):371. doi: 10.3390/e25020371.
2
Multiview Clustering of Adaptive Sparse Representation Based on Coupled P Systems.基于耦合P系统的自适应稀疏表示多视图聚类
Entropy (Basel). 2022 Apr 18;24(4):568. doi: 10.3390/e24040568.
3
Enhancing web search result clustering model based on multiview multirepresentation consensus cluster ensemble (mmcc) approach.基于多视图多表示共识聚类集成(mmcc)方法的增强型网络搜索结果聚类模型。
PLoS One. 2021 Jan 15;16(1):e0245264. doi: 10.1371/journal.pone.0245264. eCollection 2021.
4
Self-Taught convolutional neural networks for short text clustering.用于短文本聚类的自学卷积神经网络。
Neural Netw. 2017 Apr;88:22-31. doi: 10.1016/j.neunet.2016.12.008. Epub 2017 Jan 12.
5
Principal component analysis: a review and recent developments.主成分分析:综述与最新进展
Philos Trans A Math Phys Eng Sci. 2016 Apr 13;374(2065):20150202. doi: 10.1098/rsta.2015.0202.
6
Machine learning. Clustering by fast search and find of density peaks.机器学习。基于密度峰值的快速搜索和发现的聚类。
Science. 2014 Jun 27;344(6191):1492-6. doi: 10.1126/science.1242072.