Joseph D Paul, Perumal Viswanathan
School of Computer Science Engineering and Information Systems, Vellore Institute of Technology University, Vellore, Tamilnadu, India.
Department of IoT, School of Computer Science and Engineering, Vellore Institute of Technology University, Vellore, Tamilnadu, India.
PeerJ Comput Sci. 2025 Mar 5;11:e2608. doi: 10.7717/peerj-cs.2608. eCollection 2025.
In forensic topical modelling, the α parameter controls the distribution of topics in documents. However, low, high, or incorrect values of α lead to topic sparsity, model overfitting, and suboptimal topic distribution. To control the word distribution across topics, the β parameter is introduced. However, low, high, or inappropriate β values lead to sparse distribution, disjointed topics, and abundant highly probable words. The β parameter, in conjunction with seed-guided words based on Term Frequency and Inverse Document Frequency, is introduced to address the issues. Nevertheless, the data often suffers from skewness or noise due to frequent co-occurrences of unrelated polysemic word pairs generated using Pointwise Mutual Information. By integrating α, β, and β into file classification systems, classification models converge to local optima with O(n log n* |V|) time complexity. To combat these challenges, this research proposes the SDOT Forensic Classification System (SFCS) with a functional parameter β that identifies seed words by evaluating semantic and contextual similarity of word vectors. As a result, the topic distribution (Θ) is compelled to model the curated seed words within the distribution, generating pertinent topics. Incorporating β into SFCS allowed the proposed model to remove 278 k irrelevant files from the and identify 5.6 k suspicious files by extracting 700 blacklisted keywords. Furthermore, this research implemented hyperparameter optimization and hyperplane maximization, resulting in a file classification accuracy of 94.6%, 94.4% precision and 96.8% recall within O(n log n) complexity.
在法医主题建模中,α参数控制文档中主题的分布。然而,α值过低、过高或不正确会导致主题稀疏、模型过拟合以及次优的主题分布。为了控制跨主题的词分布,引入了β参数。然而,β值过低、过高或不合适会导致分布稀疏、主题不连贯以及出现大量高概率词。引入β参数并结合基于词频和逆文档频率的种子引导词来解决这些问题。尽管如此,由于使用点互信息生成的不相关多义词对频繁共现,数据常常存在偏度或噪声。通过将α、β和β集成到文件分类系统中,分类模型以O(n log n * |V|)的时间复杂度收敛到局部最优。为应对这些挑战,本研究提出了具有功能参数β的SDOT法医分类系统(SFCS),该参数通过评估词向量的语义和上下文相似度来识别种子词。结果,主题分布(Θ)被迫在分布内对精心策划的种子词进行建模,从而生成相关主题。将β纳入SFCS使所提出的模型从 中移除了27.8万个无关文件,并通过提取700个黑名单关键词识别出5600个可疑文件。此外,本研究实施了超参数优化和超平面最大化,在O(n log n)复杂度内实现了94.6%的文件分类准确率、94.4%的精确率和96.8%的召回率。