Suppr超能文献

使用潜在狄利克雷分配(LDA)和HTML标签的网页内容主题建模

Web content topic modeling using LDA and HTML tags.

作者信息

Altarturi Hamza H M, Saadoon Muntadher, Anuar Nor Badrul

机构信息

Department of Computer System and Technology, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia.

Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia.

出版信息

PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.

Abstract

An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. Topic coherence is the standard metric to measure the quality of topic models. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. This study aims to propose an innovative topic model to learn coherence topics in web content data. We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. We conducted two series of experiments to demonstrate the limitations of the existing topic models and examine the topic coherence of the HTM against the widely used Latent Dirichlet Allocation (LDA) model and its variants, namely the Correlated Topic Model, the Dirichlet Multinomial Regression, the Hierarchical Dirichlet Process, the Hierarchical Latent Dirichlet Allocation, the pseudo-document based Topic Model, and the Supervised Latent Dirichlet Allocation models. The first experiment demonstrates the limitations of the existing topic models when applied to web content data and, therefore, the essential need for a web content topic model. When applied to web data, the overall performance dropped an average of five times and, in some cases, up to approximately 20 times lower than when applied to conventional data. The second experiment then evaluates the effectiveness of the HTM model in discovering topics and term patterns of web content data. The HTM model achieved an overall 35% improvement in topic coherence compared to the LDA.

摘要

大量的数字文档存在于线上和线下,其内容能够提供有用的信息和见解。利用主题建模可增强对数字文档的分析和理解。主题建模能在一组数字文本文件中发现潜在的语义结构或主题。物联网、区块链、推荐系统和搜索引擎优化应用程序都使用主题建模来处理数据挖掘任务,如分类和聚类。主题模型的实用性取决于高质量的结果词模式和主题。主题连贯性是衡量主题模型质量的标准指标。以往的研究构建主题模型通常是针对传统文档,由于传统文档和HTML文档结构不同,将其应用于网页内容数据时会不够充分且表现不佳。忽略网页内容的独特结构会导致错过原本连贯的主题,从而降低主题质量。本研究旨在提出一种创新的主题模型,以学习网页内容数据中的连贯主题。我们提出了HTML主题模型(HTM),这是一种考虑HTML标签以理解网页结构的网页内容主题模型。我们进行了两组实验,以证明现有主题模型的局限性,并将HTM的主题连贯性与广泛使用的潜在狄利克雷分配(LDA)模型及其变体进行比较,这些变体包括相关主题模型、狄利克雷多项式回归、层次狄利克雷过程、层次潜在狄利克雷分配、基于伪文档的主题模型和监督潜在狄利克雷分配模型。第一个实验证明了现有主题模型应用于网页内容数据时的局限性,因此说明了网页内容主题模型的必要性。应用于网页数据时,整体性能平均下降了五倍,在某些情况下,比应用于传统数据时低约20倍。第二个实验则评估了HTM模型在发现网页内容数据的主题和词模式方面的有效性。与LDA相比,HTM模型在主题连贯性方面总体提高了35%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/e29f64355e84/peerj-cs-09-1459-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验