Thielmann Anton, Weisser Christoph, Krenz Astrid, Säfken Benjamin
Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.
Campus-Institut Data Science (CIDAS), Göttingen, Germany.
J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023.
Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
针对不平衡数据集的无监督文档分类带来了重大挑战。为了获得准确的分类结果,训练数据集通常由人工手动创建,这需要专业知识、时间和金钱。根据数据集的不平衡程度,这种方法要么需要对所有数据进行人工标注,要么无法充分识别代表性不足的类别。我们提出将网络爬虫、单类支持向量机(SVM)和潜在狄利克雷分配(LDA)主题建模集成起来,作为一种规避人工标注的多步分类规则。通过集成域外训练数据实现了无监督单类文档分类,并且超过80%的目标数据被正确分类。因此,所提出的方法甚至优于常见的机器学习分类器,并在多个数据集上得到了验证。