• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

整合网络爬虫、单类支持向量机和潜在狄利克雷分配主题建模的无监督文档分类

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.

作者信息

Thielmann Anton, Weisser Christoph, Krenz Astrid, Säfken Benjamin

机构信息

Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany.

Campus-Institut Data Science (CIDAS), Göttingen, Germany.

出版信息

J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023.

DOI:10.1080/02664763.2021.1919063
PMID:36819086
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9930816/
Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

摘要

针对不平衡数据集的无监督文档分类带来了重大挑战。为了获得准确的分类结果,训练数据集通常由人工手动创建,这需要专业知识、时间和金钱。根据数据集的不平衡程度,这种方法要么需要对所有数据进行人工标注,要么无法充分识别代表性不足的类别。我们提出将网络爬虫、单类支持向量机(SVM)和潜在狄利克雷分配(LDA)主题建模集成起来,作为一种规避人工标注的多步分类规则。通过集成域外训练数据实现了无监督单类文档分类,并且超过80%的目标数据被正确分类。因此,所提出的方法甚至优于常见的机器学习分类器,并在多个数据集上得到了验证。

相似文献

1
Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.整合网络爬虫、单类支持向量机和潜在狄利克雷分配主题建模的无监督文档分类
J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023.
2
Supporting systematic reviews using LDA-based document representations.使用基于潜在狄利克雷分配(LDA)的文档表示法支持系统评价。
Syst Rev. 2015 Nov 26;4:172. doi: 10.1186/s13643-015-0117-0.
3
Knowledge-Based Topic Model for Unsupervised Object Discovery and Localization.基于知识的无监督目标发现和定位主题模型。
IEEE Trans Image Process. 2018;27(1):50-63. doi: 10.1109/TIP.2017.2718667.
4
LDA filter: A Latent Dirichlet Allocation preprocess method for Weka.LDA 过滤器:一种用于 WEKA 的潜在狄利克雷分配预处理方法。
PLoS One. 2020 Nov 9;15(11):e0241701. doi: 10.1371/journal.pone.0241701. eCollection 2020.
5
Identifying Medication-Related Intents From a Bidirectional Text Messaging Platform for Hypertension Management Using an Unsupervised Learning Approach: Retrospective Observational Pilot Study.利用无监督学习方法从双向短信平台识别高血压管理相关药物意向:回顾性观察性试点研究。
J Med Internet Res. 2022 Jun 29;24(6):e36151. doi: 10.2196/36151.
6
Learning machines and sleeping brains: Automatic sleep stage classification using decision-tree multi-class support vector machines.学习机器与睡眠大脑:使用决策树多类支持向量机进行自动睡眠阶段分类
J Neurosci Methods. 2015 Jul 30;250:94-105. doi: 10.1016/j.jneumeth.2015.01.022. Epub 2015 Jan 25.
7
Imbalanced Protein Data Classification Using Ensemble FTM-SVM.使用集成FTM-SVM的不平衡蛋白质数据分类
IEEE Trans Nanobioscience. 2015 Jun;14(4):350-359. doi: 10.1109/TNB.2015.2431292. Epub 2015 May 8.
8
Inverse free reduced universum twin support vector machine for imbalanced data classification.用于不平衡数据分类的逆自由约简全域孪生支持向量机
Neural Netw. 2023 Jan;157:125-135. doi: 10.1016/j.neunet.2022.10.003. Epub 2022 Oct 15.
9
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法:一种用于判别式多类别蛋白质折叠和超家族识别的工具。
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
10
Class-imbalanced classifiers for high-dimensional data.高维数据的不平衡分类器。
Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

引用本文的文献

1
A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information.一种用于在没有标签信息的情况下分析和分类阿法尔奥罗莫语电子医疗文档的主题建模方法。
Sci Rep. 2024 Dec 30;14(1):32051. doi: 10.1038/s41598-024-83743-3.
2
How perceived sustainability influences consumers' clothing preferences.消费者感知的可持续性如何影响其服装偏好。
Sci Rep. 2024 Nov 19;14(1):28672. doi: 10.1038/s41598-024-80279-4.
3
Editorial to the special issue: Statistical Approaches for Big Data and Machine Learning.特刊社论:大数据与机器学习的统计方法
J Appl Stat. 2023 Feb 7;50(3):451-455. doi: 10.1080/02664763.2023.2162471. eCollection 2023.
4
An iterative topic model filtering framework for short and noisy user-generated data: analyzing conspiracy theories on twitter.用于简短且有噪声的用户生成数据的迭代主题模型过滤框架:分析推特上的阴谋论
Int J Data Sci Anal. 2022 May 6:1-21. doi: 10.1007/s41060-022-00321-4.
5
Clinical Text Data Categorization and Feature Extraction Using Medical-Fissure Algorithm and Neg-Seq Algorithm.临床文本数据分类和特征提取使用医学裂隙算法和 Neg-Seq 算法。
Comput Intell Neurosci. 2022 Mar 7;2022:5759521. doi: 10.1155/2022/5759521. eCollection 2022.

本文引用的文献

1
An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.一种有效的支持生物注释的生物医学文献分类方案:解决类不平衡问题。
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz045.
2
An approach for classification of highly imbalanced data using weighting and undersampling.一种基于加权和欠采样的高度不平衡数据分类方法。
Amino Acids. 2010 Nov;39(5):1385-91. doi: 10.1007/s00726-010-0595-2. Epub 2010 Apr 22.
3
Estimating the support of a high-dimensional distribution.估计高维分布的支撑集。
Neural Comput. 2001 Jul;13(7):1443-71. doi: 10.1162/089976601750264965.