使用大语言模型对短文本进行人类可解释的聚类

Human-interpretable clustering of short text using large language models.

作者信息

Miller Justin K, Alexander Tristram J

机构信息

School of Physics, The University of Sydney, Sydney, Australia.

出版信息

R Soc Open Sci. 2025 Jan 22;12(1):241692. doi: 10.1098/rsos.241692. eCollection 2025 Jan.

DOI:10.1098/rsos.241692

PMID:39845717

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11750404/

Abstract

Clustering short text is a difficult problem, owing to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating embeddings that capture the semantic nuances of short text. In this study, clusters are found in the embedding space using Gaussian mixture modelling. The resulting clusters are found to be more distinctive and more human-interpretable than clusters produced using the popular methods of doc2vec and latent Dirichlet allocation. The success of the clustering approach is quantified using human reviewers and through the use of a generative LLM. The generative LLM shows good agreement with the human reviewers and is suggested as a means to bridge the 'validation gap' which often exists between cluster production and cluster interpretation. The comparison between LLM coding and human coding reveals intrinsic biases in each, challenging the conventional reliance on human coding as the definitive standard for cluster validation.

摘要

由于短文本文件之间的词汇共现率较低，对短文本进行聚类是一个难题。这项工作表明，大语言模型（LLMs）可以通过生成能够捕捉短文本语义细微差别的嵌入来克服传统聚类方法的局限性。在本研究中，使用高斯混合模型在嵌入空间中找到聚类。结果发现，与使用流行的doc2vec和潜在狄利克雷分配方法生成的聚类相比，所得聚类更具独特性且更易于人类解释。使用人工评审人员并通过使用生成式大语言模型对聚类方法的成功进行了量化。生成式大语言模型与人工评审人员显示出良好的一致性，并被建议作为弥合聚类生成与聚类解释之间经常存在的“验证差距”的一种手段。大语言模型编码与人工编码之间的比较揭示了各自内在的偏差，这对传统上依赖人工编码作为聚类验证的决定性标准提出了挑战。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88aa/11750404/cca5db68952d/rsos.241692.f001.jpg

相似文献

Human-interpretable clustering of short text using large language models.使用大语言模型对短文本进行人类可解释的聚类

R Soc Open Sci. 2025 Jan 22;12(1):241692. doi: 10.1098/rsos.241692. eCollection 2025 Jan.

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用，以实现短文本的可解释主题。

Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.

Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。

Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Analysis of eligibility criteria clusters based on large language models for clinical trial design.基于大语言模型的临床试验设计资格标准聚类分析。

J Am Med Inform Assoc. 2025 Mar 1;32(3):447-458. doi: 10.1093/jamia/ocae311.

Short text topic modelling using local and global word-context semantic correlation.使用局部和全局词上下文语义相关性的短文本主题建模

Multimed Tools Appl. 2023 Feb 2:1-23. doi: 10.1007/s11042-023-14352-x.

Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy.基于关键词频率驱动的提示增强策略优化生物医学信息检索

BMC Bioinformatics. 2024 Aug 27;25(1):281. doi: 10.1186/s12859-024-05902-7.

Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions.通过向语言模型提问来构建用于语言神经科学的可解释嵌入。

Adv Neural Inf Process Syst. 2024;37:124137-124162.

WEClustering: word embeddings based text clustering technique for large datasets.WE聚类：用于大型数据集的基于词嵌入的文本聚类技术。

Complex Intell Systems. 2021;7(6):3211-3224. doi: 10.1007/s40747-021-00512-9. Epub 2021 Sep 7.

Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.大语言模型可通过单一提示实现社交媒体语料库的归纳主题分析：人类验证研究。

JMIR Infodemiology. 2024 Aug 29;4:e59641. doi: 10.2196/59641.

本文引用的文献

Applying BERT and ChatGPT for Sentiment Analysis of Lyme Disease in Scientific Literature.应用 BERT 和 ChatGPT 对科学文献中的莱姆病进行情感分析。

Methods Mol Biol. 2024;2742:173-183. doi: 10.1007/978-1-0716-3561-2_14.

ChatGPT outperforms crowd workers for text-annotation tasks.在文本注释任务中，ChatGPT的表现优于众包工作者。

Proc Natl Acad Sci U S A. 2023 Jul 25;120(30):e2305016120. doi: 10.1073/pnas.2305016120. Epub 2023 Jul 18.

Artificial intelligence in scientific writing: a friend or a foe?人工智能在科学写作中的应用：是敌是友？

Reprod Biomed Online. 2023 Jul;47(1):3-9. doi: 10.1016/j.rbmo.2023.04.009. Epub 2023 Apr 20.

A comparative investigation of integral- and separable-dimension stimulus-sorting behavior.整体维度和可分维度刺激分类行为的比较研究。

Psychol Res. 2023 Sep;87(6):1917-1943. doi: 10.1007/s00426-022-01753-0. Epub 2022 Nov 25.

The performance of BERT as data representation of text clustering.作为文本聚类数据表示的BERT性能。

J Big Data. 2022;9(1):15. doi: 10.1186/s40537-022-00564-9. Epub 2022 Feb 8.

When unsupervised training benefits category learning.无监督训练何时有益于类别学习。

Cognition. 2022 Apr;221:104984. doi: 10.1016/j.cognition.2021.104984. Epub 2021 Dec 23.

Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

The REDCap consortium: Building an international community of software platform partners.REDCap 联盟：构建软件平台合作伙伴的国际社区。

J Biomed Inform. 2019 Jul;95:103208. doi: 10.1016/j.jbi.2019.103208. Epub 2019 May 9.

Clustering algorithms: A comparative approach.聚类算法：一种比较方法。

PLoS One. 2019 Jan 15;14(1):e0210236. doi: 10.1371/journal.pone.0210236. eCollection 2019.

Asymptotic Properties of Spearman's Rank Correlation for Variables with Finite Support.具有有限支撑变量的斯皮尔曼等级相关的渐近性质。

PLoS One. 2016 Jan 5;11(1):e0145595. doi: 10.1371/journal.pone.0145595. eCollection 2016.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用大语言模型对短文本进行人类可解释的聚类

Human-interpretable clustering of short text using large language models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献