LDA 过滤器：一种用于 WEKA 的潜在狄利克雷分配预处理方法。

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.

机构信息

Computer Science Dept., Univ. of Vigo, Escuela Superior de Ingeniería Informática, Ourense, Spain.

CINBIO - Biomedical Research Centre, Univ. of Vigo, Vigo, Spain.

出版信息

PLoS One. 2020 Nov 9;15(11):e0241701. doi: 10.1371/journal.pone.0241701. eCollection 2020.

DOI:10.1371/journal.pone.0241701

PMID:33166342

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7652301/

Abstract

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.

摘要

这项工作提出了一种基于 LDA（潜在狄利克雷分配）的文档表示方法，以及与常见的文本表示方法相比，它如何影响分类算法。LDA 假设每个文档都涉及一组预定义的主题，这些主题是整个词汇表上的分布。我们的主要目标是使用文档属于每个主题的概率来实现新的文本表示模型。该技术作为 Weka 软件的扩展，作为新的过滤器进行部署。为了展示其性能，在所创建的过滤器上测试了不同的分类器，如支持向量机 (SVM)、k-最近邻 (k-NN) 和朴素贝叶斯在不同的文档语料库（OHSUMED、Reuters-21578、20Newsgroup、Yahoo! Answers、YELP Polarity 和 TREC Genomics 2015）中。然后，将其与词袋（BoW）表示技术进行比较。结果表明，我们提出的过滤器的应用可以达到与 BoW 相似的准确性，但大大提高了分类处理时间。

相似文献

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.LDA 过滤器：一种用于 WEKA 的潜在狄利克雷分配预处理方法。

PLoS One. 2020 Nov 9;15(11):e0241701. doi: 10.1371/journal.pone.0241701. eCollection 2020.

Improving the utility of MeSH® terms using the TopicalMeSH representation.使用主题词表（TopicalMeSH）表示法提高医学主题词表（MeSH®）术语的实用性。

J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19.

Supporting systematic reviews using LDA-based document representations.使用基于潜在狄利克雷分配（LDA）的文档表示法支持系统评价。

Syst Rev. 2015 Nov 26;4:172. doi: 10.1186/s13643-015-0117-0.

Defining and evaluating classification algorithm for high-dimensional data based on latent topics.基于潜在主题定义和评估高维数据的分类算法

PLoS One. 2014 Jan 9;9(1):e82119. doi: 10.1371/journal.pone.0082119. eCollection 2014.

Link-topic model for biomedical abbreviation disambiguation.用于生物医学缩写词消歧的链接主题模型

J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.

Automated Classification of Free-Text Radiology Reports: Using Different Feature Extraction Methods to Identify Fractures of the Distal Fibula.自动化自由文本放射学报告分类：使用不同的特征提取方法识别腓骨远端骨折。

Rofo. 2023 Aug;195(8):713-719. doi: 10.1055/a-2061-6562. Epub 2023 May 9.

A LDA-based approach to promoting ranking diversity for genomics information retrieval.基于 LDA 的方法提高基因组信息检索的排名多样性。

BMC Genomics. 2012 Jun 11;13 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2164-13-S3-S2.

Interpretable Probabilistic Latent Variable Models for Automatic Annotation of Clinical Text.用于临床文本自动标注的可解释概率潜在变量模型

AMIA Annu Symp Proc. 2015 Nov 5;2015:785-94. eCollection 2015.

Probabilistic topic modeling for the analysis and classification of genomic sequences.用于基因组序列分析和分类的概率主题建模

BMC Bioinformatics. 2015;16 Suppl 6(Suppl 6):S2. doi: 10.1186/1471-2105-16-S6-S2. Epub 2015 Apr 17.

Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span.生物医学语料库的统计建模：挖掘秀丽隐杆线虫遗传中心文献中与寿命相关的基因

BMC Bioinformatics. 2006 May 8;7:250. doi: 10.1186/1471-2105-7-250.

引用本文的文献

Changes in Food Security, Healthfulness, and Access During the Coronavirus Disease 2019 Pandemic: Results From a National United States Survey.2019年冠状病毒病大流行期间食品安全、健康程度及可及性的变化：一项美国全国性调查的结果

Curr Dev Nutr. 2023 Mar;7(3):100060. doi: 10.1016/j.cdnut.2023.100060. Epub 2023 Feb 26.

Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method.基于机器学习和潜在狄利克雷分配方法的整合结构化和非结构化电子健康记录数据预测死亡率。

Int J Environ Res Public Health. 2023 Feb 28;20(5):4340. doi: 10.3390/ijerph20054340.

Affective Cognition of Students' Autonomous Learning in College English Teaching Based on Deep Learning.基于深度学习的大学英语教学中大学生自主学习的情感认知

Front Psychol. 2022 Jan 19;12:808434. doi: 10.3389/fpsyg.2021.808434. eCollection 2021.

The Impact of COVID-19 on Consumers' Psychological Behavior Based on Data Mining for Online User Comments in the Catering Industry in China.基于中国餐饮业在线用户评论数据挖掘的 COVID-19 对消费者心理行为的影响。

Int J Environ Res Public Health. 2021 Apr 15;18(8):4178. doi: 10.3390/ijerph18084178.

本文引用的文献

A Method of Short Text Representation Based on the Feature Probability Embedded Vector.一种基于特征概率嵌入向量的短文本表示方法。

Sensors (Basel). 2019 Aug 28;19(17):3728. doi: 10.3390/s19173728.

Data mining in bioinformatics using Weka.使用Weka进行生物信息学中的数据挖掘。

Bioinformatics. 2004 Oct 12;20(15):2479-81. doi: 10.1093/bioinformatics/bth261. Epub 2004 Apr 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

LDA 过滤器：一种用于 WEKA 的潜在狄利克雷分配预处理方法。

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献