Suppr超能文献

LDA 过滤器:一种用于 WEKA 的潜在狄利克雷分配预处理方法。

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.

机构信息

Computer Science Dept., Univ. of Vigo, Escuela Superior de Ingeniería Informática, Ourense, Spain.

CINBIO - Biomedical Research Centre, Univ. of Vigo, Vigo, Spain.

出版信息

PLoS One. 2020 Nov 9;15(11):e0241701. doi: 10.1371/journal.pone.0241701. eCollection 2020.

Abstract

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.

摘要

这项工作提出了一种基于 LDA(潜在狄利克雷分配)的文档表示方法,以及与常见的文本表示方法相比,它如何影响分类算法。LDA 假设每个文档都涉及一组预定义的主题,这些主题是整个词汇表上的分布。我们的主要目标是使用文档属于每个主题的概率来实现新的文本表示模型。该技术作为 Weka 软件的扩展,作为新的过滤器进行部署。为了展示其性能,在所创建的过滤器上测试了不同的分类器,如支持向量机 (SVM)、k-最近邻 (k-NN) 和朴素贝叶斯在不同的文档语料库(OHSUMED、Reuters-21578、20Newsgroup、Yahoo! Answers、YELP Polarity 和 TREC Genomics 2015)中。然后,将其与词袋(BoW)表示技术进行比较。结果表明,我们提出的过滤器的应用可以达到与 BoW 相似的准确性,但大大提高了分类处理时间。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验