基于潜在主题定义和评估高维数据的分类算法

Defining and evaluating classification algorithm for high-dimensional data based on latent topics.

作者信息

Luo Le, Li Li

机构信息

Faculty of Computer and Information Science, Southwest University, Chongqing, China.

出版信息

PLoS One. 2014 Jan 9;9(1):e82119. doi: 10.1371/journal.pone.0082119. eCollection 2014.

DOI:10.1371/journal.pone.0082119

PMID:24416136

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3886981/

Abstract

Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.

摘要

自动文本分类是信息检索和数据挖掘领域的关键技术之一。当训练数据集规模大且维度高时，分类通常很耗时。人们已经提出了许多方法来解决这个问题，但很少有方法能达到令人满意的效率。在本文中，我们提出了一种将潜在狄利克雷分配（LDA）算法和支持向量机（SVM）相结合的方法。首先使用LDA在向量空间模型（VSM）中生成主题的降维表示作为特征。它能够大幅减少特征数量，同时保留必要的语义信息。然后使用支持向量机（SVM）基于生成的特征对数据进行分类。我们分别在20个新闻组和路透社-21578数据集上对该算法进行了评估。实验结果表明，基于我们提出的LDA+SVM模型的分类在精确率、召回率和F1值方面都取得了高性能。此外，它能在更短的时间内实现这一点。我们的方法在该领域的先前工作基础上有了很大改进，并且在广泛的应用中显示出实现简化分类过程的强大潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b06/3886981/825e6a10c2fd/pone.0082119.g001.jpg

相似文献

Defining and evaluating classification algorithm for high-dimensional data based on latent topics.基于潜在主题定义和评估高维数据的分类算法

PLoS One. 2014 Jan 9;9(1):e82119. doi: 10.1371/journal.pone.0082119. eCollection 2014.

Improving the utility of MeSH® terms using the TopicalMeSH representation.使用主题词表（TopicalMeSH）表示法提高医学主题词表（MeSH®）术语的实用性。

J Biomed Inform. 2016 Jun;61:77-86. doi: 10.1016/j.jbi.2016.03.013. Epub 2016 Mar 19.

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.LDA 过滤器：一种用于 WEKA 的潜在狄利克雷分配预处理方法。

PLoS One. 2020 Nov 9;15(11):e0241701. doi: 10.1371/journal.pone.0241701. eCollection 2020.

Topic2features: a novel framework to classify noisy and sparse textual data using LDA topic distributions.主题2特征：一种使用LDA主题分布对噪声和稀疏文本数据进行分类的新颖框架。

PeerJ Comput Sci. 2021 Aug 11;7:e677. doi: 10.7717/peerj-cs.677. eCollection 2021.

Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling.基于集成剪枝和优化主题建模的生物医学文本分类

Comput Math Methods Med. 2018 Jul 22;2018:2497471. doi: 10.1155/2018/2497471. eCollection 2018.

Probabilistic topic modeling for the analysis and classification of genomic sequences.用于基因组序列分析和分类的概率主题建模

BMC Bioinformatics. 2015;16 Suppl 6(Suppl 6):S2. doi: 10.1186/1471-2105-16-S6-S2. Epub 2015 Apr 17.

Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec.基于改进 TF-IDF 和 Word2Vec 的旅游景点子类别的文本分类算法。

PLoS One. 2024 Oct 18;19(10):e0305095. doi: 10.1371/journal.pone.0305095. eCollection 2024.

Enhancing text categorization with semantic-enriched representation and training data augmentation.通过语义丰富的表示和训练数据增强来提升文本分类

J Am Med Inform Assoc. 2006 Sep-Oct;13(5):526-35. doi: 10.1197/jamia.M2051. Epub 2006 Jun 23.

Computer-assisted lip diagnosis on Traditional Chinese Medicine using multi-class support vector machines.基于多类支持向量机的中医唇诊计算机辅助诊断。

BMC Complement Altern Med. 2012 Aug 16;12:127. doi: 10.1186/1472-6882-12-127.

Improving gene expression cancer molecular pattern discovery using nonnegative principal component analysis.使用非负主成分分析改进基因表达癌症分子模式发现

Genome Inform. 2008;21:200-11.

引用本文的文献

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.使用具有分组、评分和建模方法的集成主题建模进行文本分类的主题选择

Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2.

A computational framework for converting textual clinical diagnostic criteria into the quality data model.一种用于将文本临床诊断标准转换为质量数据模型的计算框架。

J Biomed Inform. 2016 Oct;63:11-21. doi: 10.1016/j.jbi.2016.07.016. Epub 2016 Jul 19.

Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation.局部嵌入自动编码器：一种文档表示的半监督流形学习方法。

PLoS One. 2016 Jan 19;11(1):e0146672. doi: 10.1371/journal.pone.0146672. eCollection 2016.

A machine learning approach to identify clinical trials involving nanodrugs and nanodevices from ClinicalTrials.gov.一种从ClinicalTrials.gov中识别涉及纳米药物和纳米装置的临床试验的机器学习方法。

PLoS One. 2014 Oct 27;9(10):e110331. doi: 10.1371/journal.pone.0110331. eCollection 2014.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于潜在主题定义和评估高维数据的分类算法

Defining and evaluating classification algorithm for high-dimensional data based on latent topics.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献