Suppr超能文献

整合网络爬虫、单类支持向量机和潜在狄利克雷分配主题建模的无监督文档分类

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.

作者信息

Thielmann Anton, Weisser Christoph, Krenz Astrid, Säfken Benjamin

机构信息

Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.

Campus-Institut Data Science (CIDAS), Göttingen, Germany.

出版信息

J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023.

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

摘要

针对不平衡数据集的无监督文档分类带来了重大挑战。为了获得准确的分类结果,训练数据集通常由人工手动创建,这需要专业知识、时间和金钱。根据数据集的不平衡程度,这种方法要么需要对所有数据进行人工标注,要么无法充分识别代表性不足的类别。我们提出将网络爬虫、单类支持向量机(SVM)和潜在狄利克雷分配(LDA)主题建模集成起来,作为一种规避人工标注的多步分类规则。通过集成域外训练数据实现了无监督单类文档分类,并且超过80%的目标数据被正确分类。因此,所提出的方法甚至优于常见的机器学习分类器,并在多个数据集上得到了验证。

相似文献

7
Imbalanced Protein Data Classification Using Ensemble FTM-SVM.使用集成FTM-SVM的不平衡蛋白质数据分类
IEEE Trans Nanobioscience. 2015 Jun;14(4):350-359. doi: 10.1109/TNB.2015.2431292. Epub 2015 May 8.
10
Class-imbalanced classifiers for high-dimensional data.高维数据的不平衡分类器。
Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验