Sánchez Marco, Urquiza Luis
Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador.
Departamento de Electrónica, Telecomunicaciones y Redes de Información, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador.
PeerJ Comput Sci. 2024 Jan 15;10:e1733. doi: 10.7717/peerj-cs.1733. eCollection 2024.
Fraud detection through auditors' manual review of accounting and financial records has traditionally relied on human experience and intuition. However, replicating this task using technological tools has represented a challenge for information security researchers. Natural language processing techniques, such as topic modeling, have been explored to extract information and categorize large sets of documents. Topic modeling, such as latent Dirichlet allocation (LDA) or non-negative matrix factorization (NMF), has recently gained popularity for discovering thematic structures in text collections. However, unsupervised topic modeling may not always produce the best results for specific tasks, such as fraud detection. Therefore, in the present work, we propose to use semi-supervised topic modeling, which allows the incorporation of specific knowledge of the study domain through the use of keywords to learn latent topics related to fraud. By leveraging relevant keywords, our proposed approach aims to identify patterns related to the vertices of the fraud triangle theory, providing more consistent and interpretable results for fraud detection. The model's performance was evaluated by training with several datasets and testing it with another one that did not intervene in its training. The results showed efficient performance averages with a 7% increase in performance compared to a previous job. Overall, the study emphasizes the importance of deepening the analysis of fraud behaviors and proposing strategies to identify them proactively.
传统上,审计人员通过人工审查会计和财务记录来进行欺诈检测,这依赖于人类经验和直觉。然而,使用技术工具来复制这项任务对信息安全研究人员来说是一项挑战。人们已经探索了自然语言处理技术,如主题建模,以提取信息并对大量文档进行分类。主题建模,如潜在狄利克雷分配(LDA)或非负矩阵分解(NMF),最近在发现文本集合中的主题结构方面受到欢迎。然而,无监督主题建模对于特定任务(如欺诈检测)可能并不总是能产生最佳结果。因此,在本研究中,我们建议使用半监督主题建模,它允许通过使用关键词纳入研究领域的特定知识,以学习与欺诈相关的潜在主题。通过利用相关关键词,我们提出的方法旨在识别与欺诈三角理论顶点相关的模式,为欺诈检测提供更一致且可解释的结果。通过使用几个数据集进行训练并使用另一个未参与其训练的数据集进行测试来评估该模型的性能。结果显示,与之前的工作相比,平均性能提高了7%,表现出高效的性能。总体而言,该研究强调了深入分析欺诈行为并提出主动识别欺诈行为策略的重要性。