• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

优化法医文件分类:通过β超参数调整增强SFCS

Optimizing forensic file classification: enhancing SFCS with β hyperparameter tuning.

作者信息

Joseph D Paul, Perumal Viswanathan

机构信息

School of Computer Science Engineering and Information Systems, Vellore Institute of Technology University, Vellore, Tamilnadu, India.

Department of IoT, School of Computer Science and Engineering, Vellore Institute of Technology University, Vellore, Tamilnadu, India.

出版信息

PeerJ Comput Sci. 2025 Mar 5;11:e2608. doi: 10.7717/peerj-cs.2608. eCollection 2025.

DOI:10.7717/peerj-cs.2608
PMID:40134876
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11935782/
Abstract

In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The β parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and β into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter β that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θ) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating β into SFCS allowed the proposed model to remove 278 k irrelevant files from the and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity.

摘要

在法医主题建模中,α参数控制文档中主题的分布。然而,α值过低、过高或不正确会导致主题稀疏、模型过拟合以及次优的主题分布。为了控制跨主题的词分布,引入了β参数。然而,β值过低、过高或不合适会导致分布稀疏、主题不连贯以及出现大量高概率词。引入β参数并结合基于词频和逆文档频率的种子引导词来解决这些问题。尽管如此,由于使用点互信息生成的不相关多义词对频繁共现,数据常常存在偏度或噪声。通过将α、β和β集成到文件分类系统中,分类模型以O(n log n * |V|)的时间复杂度收敛到局部最优。为应对这些挑战,本研究提出了具有功能参数β的SDOT法医分类系统(SFCS),该参数通过评估词向量的语义和上下文相似度来识别种子词。结果,主题分布(Θ)被迫在分布内对精心策划的种子词进行建模,从而生成相关主题。将β纳入SFCS使所提出的模型从 中移除了27.8万个无关文件,并通过提取700个黑名单关键词识别出5600个可疑文件。此外,本研究实施了超参数优化和超平面最大化,在O(n log n)复杂度内实现了94.6%的文件分类准确率、94.4%的精确率和96.8%的召回率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/f489fbed83b4/peerj-cs-11-2608-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/c920bac062d3/peerj-cs-11-2608-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/e83467ba71b4/peerj-cs-11-2608-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/72d83dedaf36/peerj-cs-11-2608-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/9320b37dd442/peerj-cs-11-2608-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/0d6c58e396c7/peerj-cs-11-2608-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/96cfbd1aaa2e/peerj-cs-11-2608-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/f489fbed83b4/peerj-cs-11-2608-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/c920bac062d3/peerj-cs-11-2608-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/e83467ba71b4/peerj-cs-11-2608-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/72d83dedaf36/peerj-cs-11-2608-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/9320b37dd442/peerj-cs-11-2608-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/0d6c58e396c7/peerj-cs-11-2608-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/96cfbd1aaa2e/peerj-cs-11-2608-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ce7/11935782/f489fbed83b4/peerj-cs-11-2608-g007.jpg

相似文献

1
Optimizing forensic file classification: enhancing SFCS with β hyperparameter tuning.优化法医文件分类:通过β超参数调整增强SFCS
PeerJ Comput Sci. 2025 Mar 5;11:e2608. doi: 10.7717/peerj-cs.2608. eCollection 2025.
2
Link-topic model for biomedical abbreviation disambiguation.用于生物医学缩写词消歧的链接主题模型
J Biomed Inform. 2015 Feb;53:367-80. doi: 10.1016/j.jbi.2014.12.013. Epub 2014 Dec 30.
3
Digital Stratigraphy: Contextual Analysis of File System Traces in Forensic Science.数字地层学:法医学中文件系统痕迹的语境分析。
J Forensic Sci. 2018 Sep;63(5):1383-1391. doi: 10.1111/1556-4029.13722. Epub 2017 Dec 28.
4
Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用,以实现短文本的可解释主题。
Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.
5
Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion.基于贝叶斯信息准则的改进简约主题模型
Entropy (Basel). 2020 Mar 12;22(3):326. doi: 10.3390/e22030326.
6
Chinese text classification by combining Chinese-BERTology-wwm and GCN.结合中文BERTology-wwm和图卷积网络进行中文文本分类。
PeerJ Comput Sci. 2023 Aug 17;9:e1544. doi: 10.7717/peerj-cs.1544. eCollection 2023.
7
Forensic analysis of anti-forensic file-wiping tools on Windows.Windows 上反取证文件擦除工具的法医分析。
J Forensic Sci. 2022 Mar;67(2):562-587. doi: 10.1111/1556-4029.14907. Epub 2021 Oct 7.
8
Classification of forensic autopsy reports through conceptual graph-based document representation model.基于概念图的文档表示模型对法医解剖报告的分类。
J Biomed Inform. 2018 Jun;82:88-105. doi: 10.1016/j.jbi.2018.04.013. Epub 2018 May 5.
9
Modeling Topics in DFA-Based Lemmatized Gujarati Text.基于 DFA 的词形还原 Gujarati 文本中的主题建模。
Sensors (Basel). 2023 Mar 1;23(5):2708. doi: 10.3390/s23052708.
10
How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information.ChatGPT 与谷歌相比如何使用来源信息?在线健康信息的文本网络分析。
Clin Orthop Relat Res. 2024 Apr 1;482(4):578-588. doi: 10.1097/CORR.0000000000002995. Epub 2024 Mar 1.

本文引用的文献

1
Topic modeling revisited:  New evidence on algorithm performance and quality metrics.主题建模再探讨:算法性能和质量指标的新证据。
PLoS One. 2022 Apr 28;17(4):e0266325. doi: 10.1371/journal.pone.0266325. eCollection 2022.
2
PAN-LDA: A latent Dirichlet allocation based novel feature extraction model for COVID-19 data using machine learning.PAN-LDA:一种基于潜在狄利克雷分配的新型特征提取模型,用于使用机器学习对 COVID-19 数据进行分析。
Comput Biol Med. 2021 Nov;138:104920. doi: 10.1016/j.compbiomed.2021.104920. Epub 2021 Oct 12.