• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用半监督主题建模和关键词整合改进欺诈检测

Improving fraud detection with semi-supervised topic modeling and keyword integration.

作者信息

Sánchez Marco, Urquiza Luis

机构信息

Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador.

Departamento de Electrónica, Telecomunicaciones y Redes de Información, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador.

出版信息

PeerJ Comput Sci. 2024 Jan 15;10:e1733. doi: 10.7717/peerj-cs.1733. eCollection 2024.

DOI:10.7717/peerj-cs.1733
PMID:38259882
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10803081/
Abstract

Fraud detection through auditors' manual review of accounting and financial records has traditionally relied on human experience and intuition. However, replicating this task using technological tools has represented a challenge for information security researchers. Natural language processing techniques, such as topic modeling, have been explored to extract information and categorize large sets of documents. Topic modeling, such as latent Dirichlet allocation (LDA) or non-negative matrix factorization (NMF), has recently gained popularity for discovering thematic structures in text collections. However, unsupervised topic modeling may not always produce the best results for specific tasks, such as fraud detection. Therefore, in the present work, we propose to use semi-supervised topic modeling, which allows the incorporation of specific knowledge of the study domain through the use of keywords to learn latent topics related to fraud. By leveraging relevant keywords, our proposed approach aims to identify patterns related to the vertices of the fraud triangle theory, providing more consistent and interpretable results for fraud detection. The model's performance was evaluated by training with several datasets and testing it with another one that did not intervene in its training. The results showed efficient performance averages with a 7% increase in performance compared to a previous job. Overall, the study emphasizes the importance of deepening the analysis of fraud behaviors and proposing strategies to identify them proactively.

摘要

传统上,审计人员通过人工审查会计和财务记录来进行欺诈检测,这依赖于人类经验和直觉。然而,使用技术工具来复制这项任务对信息安全研究人员来说是一项挑战。人们已经探索了自然语言处理技术,如主题建模,以提取信息并对大量文档进行分类。主题建模,如潜在狄利克雷分配(LDA)或非负矩阵分解(NMF),最近在发现文本集合中的主题结构方面受到欢迎。然而,无监督主题建模对于特定任务(如欺诈检测)可能并不总是能产生最佳结果。因此,在本研究中,我们建议使用半监督主题建模,它允许通过使用关键词纳入研究领域的特定知识,以学习与欺诈相关的潜在主题。通过利用相关关键词,我们提出的方法旨在识别与欺诈三角理论顶点相关的模式,为欺诈检测提供更一致且可解释的结果。通过使用几个数据集进行训练并使用另一个未参与其训练的数据集进行测试来评估该模型的性能。结果显示,与之前的工作相比,平均性能提高了7%,表现出高效的性能。总体而言,该研究强调了深入分析欺诈行为并提出主动识别欺诈行为策略的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/cb22a030bec5/peerj-cs-10-1733-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/d1f8e77e8200/peerj-cs-10-1733-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/fc6d101b31f7/peerj-cs-10-1733-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/cb22a030bec5/peerj-cs-10-1733-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/d1f8e77e8200/peerj-cs-10-1733-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/fc6d101b31f7/peerj-cs-10-1733-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f11/10803081/cb22a030bec5/peerj-cs-10-1733-g003.jpg

相似文献

1
Improving fraud detection with semi-supervised topic modeling and keyword integration.利用半监督主题建模和关键词整合改进欺诈检测
PeerJ Comput Sci. 2024 Jan 15;10:e1733. doi: 10.7717/peerj-cs.1733. eCollection 2024.
2
Web content topic modeling using LDA and HTML tags.使用潜在狄利克雷分配(LDA)和HTML标签的网页内容主题建模
PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.
3
An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.
4
Mining FDA drug labels using an unsupervised learning technique--topic modeling.利用无监督学习技术——主题建模挖掘 FDA 药物标签。
BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S11. doi: 10.1186/1471-2105-12-S10-S11.
5
Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用,以实现短文本的可解释主题。
Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.
6
Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis.使用主题建模方法处理短文本数据:一项比较分析。
Front Artif Intell. 2020 Jul 14;3:42. doi: 10.3389/frai.2020.00042. eCollection 2020.
7
Cardiology record multi-label classification using latent Dirichlet allocation.使用潜在狄利克雷分配进行心脏病学记录的多标签分类。
Comput Methods Programs Biomed. 2018 Oct;164:111-119. doi: 10.1016/j.cmpb.2018.07.002. Epub 2018 Jul 17.
8
Data mining application to healthcare fraud detection: a two-step unsupervised clustering method for outlier detection with administrative databases.数据挖掘在医疗保健欺诈检测中的应用:基于行政数据库的两步无监督聚类异常检测方法。
BMC Med Inform Decis Mak. 2020 Jul 14;20(1):160. doi: 10.1186/s12911-020-01143-9.
9
Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework.使用基于XGBoost的框架进行移动支付系统中的欺诈检测。
Inf Syst Front. 2022 Oct 14:1-19. doi: 10.1007/s10796-022-10346-6.
10
Comparison of Methods for Estimating Temporal Topic Models From Primary Care Clinical Text Data: Retrospective Closed Cohort Study.从基层医疗临床文本数据估计时间主题模型的方法比较:回顾性封闭队列研究
JMIR Med Inform. 2022 Dec 19;10(12):e40102. doi: 10.2196/40102.

引用本文的文献

1
Fast2Vec, a modified model of FastText that enhances semantic analysis in topic evolution.Fast2Vec,一种改进的FastText模型,可增强主题演变中的语义分析。
PeerJ Comput Sci. 2025 May 19;11:e2862. doi: 10.7717/peerj-cs.2862. eCollection 2025.

本文引用的文献

1
A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts.LDA、NMF、Top2Vec和BERTopic用于揭秘推特帖子的主题建模比较
Front Sociol. 2022 May 6;7:886498. doi: 10.3389/fsoc.2022.886498. eCollection 2022.
2
Ensemble learning for the early prediction of neonatal jaundice with genetic features.基于遗传特征的新生儿黄疸早期预测的集成学习。
BMC Med Inform Decis Mak. 2021 Dec 1;21(1):338. doi: 10.1186/s12911-021-01701-9.
3
Preprocessing Arabic text on social media.社交媒体上阿拉伯语文本的预处理
Heliyon. 2021 Feb 13;7(2):e06191. doi: 10.1016/j.heliyon.2021.e06191. eCollection 2021 Feb.
4
A Method for Generating Synthetic Electronic Medical Record Text.一种生成合成电子病历文本的方法。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):173-182. doi: 10.1109/TCBB.2019.2948985. Epub 2021 Feb 3.
5
Content based medical image retrieval using topic and location model.基于主题和位置模型的医学图像内容检索。
J Biomed Inform. 2019 Mar;91:103112. doi: 10.1016/j.jbi.2019.103112. Epub 2019 Feb 6.
6
AucPR: an AUC-based approach using penalized regression for disease prediction with high-dimensional omics data.AucPR:一种基于AUC的方法,使用惩罚回归对高维组学数据进行疾病预测。
BMC Genomics. 2014;15 Suppl 10(Suppl 10):S1. doi: 10.1186/1471-2164-15-S10-S1. Epub 2014 Dec 12.
7
Exploration of analysis methods for diagnostic imaging tests: problems with ROC AUC and confidence scores in CT colonography.诊断成像测试分析方法的探索:CT结肠成像中ROC AUC和置信度评分的问题
PLoS One. 2014 Oct 29;9(10):e107633. doi: 10.1371/journal.pone.0107633. eCollection 2014.
8
[Metabolic profiling of human blood].[人体血液的代谢谱分析]
Biomed Khim. 2014 May-Jun;60(3):281-94. doi: 10.18097/pbmc20146003281.
9
Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors.通过狄利克雷森林先验将领域知识融入主题建模。
Proc Int Conf Mach Learn. 2009;382(26):25-32. doi: 10.1145/1553374.1553378.
10
Receiver operating characteristic (ROC) curve: practical review for radiologists.接受者操作特征(ROC)曲线:放射科医生实用综述
Korean J Radiol. 2004 Jan-Mar;5(1):11-8. doi: 10.3348/kjr.2004.5.1.11.