• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用潜在狄利克雷分配(LDA)和HTML标签的网页内容主题建模

Web content topic modeling using LDA and HTML tags.

作者信息

Altarturi Hamza H M, Saadoon Muntadher, Anuar Nor Badrul

机构信息

Department of Computer System and Technology, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia.

Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia.

出版信息

PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.

DOI:10.7717/peerj-cs.1459
PMID:37547394
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10403181/
Abstract

An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. Topic coherence is the standard metric to measure the quality of topic models. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. This study aims to propose an innovative topic model to learn coherence topics in web content data. We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. We conducted two series of experiments to demonstrate the limitations of the existing topic models and examine the topic coherence of the HTM against the widely used Latent Dirichlet Allocation (LDA) model and its variants, namely the Correlated Topic Model, the Dirichlet Multinomial Regression, the Hierarchical Dirichlet Process, the Hierarchical Latent Dirichlet Allocation, the pseudo-document based Topic Model, and the Supervised Latent Dirichlet Allocation models. The first experiment demonstrates the limitations of the existing topic models when applied to web content data and, therefore, the essential need for a web content topic model. When applied to web data, the overall performance dropped an average of five times and, in some cases, up to approximately 20 times lower than when applied to conventional data. The second experiment then evaluates the effectiveness of the HTM model in discovering topics and term patterns of web content data. The HTM model achieved an overall 35% improvement in topic coherence compared to the LDA.

摘要

大量的数字文档存在于线上和线下,其内容能够提供有用的信息和见解。利用主题建模可增强对数字文档的分析和理解。主题建模能在一组数字文本文件中发现潜在的语义结构或主题。物联网、区块链、推荐系统和搜索引擎优化应用程序都使用主题建模来处理数据挖掘任务,如分类和聚类。主题模型的实用性取决于高质量的结果词模式和主题。主题连贯性是衡量主题模型质量的标准指标。以往的研究构建主题模型通常是针对传统文档,由于传统文档和HTML文档结构不同,将其应用于网页内容数据时会不够充分且表现不佳。忽略网页内容的独特结构会导致错过原本连贯的主题,从而降低主题质量。本研究旨在提出一种创新的主题模型,以学习网页内容数据中的连贯主题。我们提出了HTML主题模型(HTM),这是一种考虑HTML标签以理解网页结构的网页内容主题模型。我们进行了两组实验,以证明现有主题模型的局限性,并将HTM的主题连贯性与广泛使用的潜在狄利克雷分配(LDA)模型及其变体进行比较,这些变体包括相关主题模型、狄利克雷多项式回归、层次狄利克雷过程、层次潜在狄利克雷分配、基于伪文档的主题模型和监督潜在狄利克雷分配模型。第一个实验证明了现有主题模型应用于网页内容数据时的局限性,因此说明了网页内容主题模型的必要性。应用于网页数据时,整体性能平均下降了五倍,在某些情况下,比应用于传统数据时低约20倍。第二个实验则评估了HTM模型在发现网页内容数据的主题和词模式方面的有效性。与LDA相比,HTM模型在主题连贯性方面总体提高了35%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/5f26eb9e2de3/peerj-cs-09-1459-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/e29f64355e84/peerj-cs-09-1459-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/107b83c0108e/peerj-cs-09-1459-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/e8b3c9c7a418/peerj-cs-09-1459-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/c6a74396ba3c/peerj-cs-09-1459-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/9650d20a44ad/peerj-cs-09-1459-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/5f26eb9e2de3/peerj-cs-09-1459-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/e29f64355e84/peerj-cs-09-1459-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/107b83c0108e/peerj-cs-09-1459-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/e8b3c9c7a418/peerj-cs-09-1459-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/c6a74396ba3c/peerj-cs-09-1459-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/9650d20a44ad/peerj-cs-09-1459-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1701/10403181/5f26eb9e2de3/peerj-cs-09-1459-g006.jpg

相似文献

1
Web content topic modeling using LDA and HTML tags.使用潜在狄利克雷分配(LDA)和HTML标签的网页内容主题建模
PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.
2
An integrated clustering and BERT framework for improved topic modeling.一种用于改进主题建模的集成聚类和BERT框架。
Int J Inf Technol. 2023;15(4):2187-2195. doi: 10.1007/s41870-023-01268-w. Epub 2023 May 6.
3
Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data.使用推特数据进行伪文档模拟,以比较LDA、GSDMM和GPM主题模型在短文本和稀疏文本上的表现。
Comput Stat. 2023;38(2):647-674. doi: 10.1007/s00180-022-01246-z. Epub 2022 Jul 9.
4
Knowledge-Based Topic Model for Unsupervised Object Discovery and Localization.基于知识的无监督目标发现和定位主题模型。
IEEE Trans Image Process. 2018;27(1):50-63. doi: 10.1109/TIP.2017.2718667.
5
Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。
Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.
6
Eliciting Insights From Chat Logs of the 25X5 Symposium to Reduce Documentation Burden: Novel Application of Topic Modeling.从 25X5 研讨会的聊天记录中获取洞察,以减轻文档负担:主题建模的新应用。
J Med Internet Res. 2023 May 17;25:e45645. doi: 10.2196/45645.
7
Gaussian hierarchical latent Dirichlet allocation: Bringing polysemy back.高斯层次潜在狄利克雷分配:使多义性回归。
PLoS One. 2023 Jul 12;18(7):e0288274. doi: 10.1371/journal.pone.0288274. eCollection 2023.
8
Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用,以实现短文本的可解释主题。
Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.
9
Cardiology record multi-label classification using latent Dirichlet allocation.使用潜在狄利克雷分配进行心脏病学记录的多标签分类。
Comput Methods Programs Biomed. 2018 Oct;164:111-119. doi: 10.1016/j.cmpb.2018.07.002. Epub 2018 Jul 17.
10
Redundancy-aware topic modeling for patient record notes.用于病历记录的冗余感知主题建模
PLoS One. 2014 Feb 13;9(2):e87555. doi: 10.1371/journal.pone.0087555. eCollection 2014.

引用本文的文献

1
Employing bibliometrics and natural language processing (NLP) to analyse real-world applications of adverse drug reaction.运用文献计量学和自然语言处理技术(NLP)分析药物不良反应的实际应用。
Explor Res Clin Soc Pharm. 2025 Mar 17;18:100592. doi: 10.1016/j.rcsop.2025.100592. eCollection 2025 Jun.
2
An adaptive method for determining the optimal number of topics in topic modeling.一种用于确定主题建模中最优主题数量的自适应方法。
PeerJ Comput Sci. 2025 Feb 28;11:e2723. doi: 10.7717/peerj-cs.2723. eCollection 2025.
3
Machine learning and game theory for cyber governance: Enhancing public opinion and regional sustainable development.

本文引用的文献

1
Topic Modeling for Interpretable Text Classification From EHRs.用于电子健康记录可解释文本分类的主题建模
Front Big Data. 2022 May 4;5:846930. doi: 10.3389/fdata.2022.846930. eCollection 2022.
2
An empirical study of Q&A websites for game developers.
Empir Softw Eng. 2021;26(6):115. doi: 10.1007/s10664-021-10014-4. Epub 2021 Aug 19.
3
Detecting Topic and Sentiment Trends in Physician Rating Websites: Analysis of Online Reviews Using 3-Wave Datasets.检测医生评级网站的主题和情感趋势:使用三波数据集的在线评论分析
用于网络治理的机器学习与博弈论:提升公众舆论与区域可持续发展
PLoS One. 2024 Dec 5;19(12):e0308317. doi: 10.1371/journal.pone.0308317. eCollection 2024.
Int J Environ Res Public Health. 2021 Apr 29;18(9):4743. doi: 10.3390/ijerph18094743.
4
A spatially varying two-sample recombinant coalescent, with applications to HIV escape response.一种空间变化的双样本重组合并模型及其在HIV逃逸反应中的应用
Adv Neural Inf Process Syst. 2008;21:662.