使用主题-噪声模型跨数据源生成特定领域的主题。

Using topic-noise models to generate domain-specific topics across data sources.

作者信息

Churchill Rob, Singh Lisa

机构信息

Department of Computer Science, Georgetown University, 3700 O Street, Washington, D.C., 20007 USA.

出版信息

Knowl Inf Syst. 2023;65(5):2159-2186. doi: 10.1007/s10115-022-01805-2. Epub 2023 Jan 16.

DOI:10.1007/s10115-022-01805-2

PMID:36683608

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9842404/

Abstract

Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an -partite graph and identifies core topics that blend topics from across sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources.

摘要

特定领域的文档集合，例如关于新冠疫情、政治和体育的数据集，随着平台的发展以及连接兴趣相投之人的方式不断完善而变得愈发常见。这些数据集来源广泛，从开放式调查和报纸文章等传统来源到众多在线社交媒体平台中的某一个。大多数主题模型能够从这些数据源中的一个或多个生成主题，但很少有模型能在所有类型的文档上都表现良好。许多模型面临的主要问题是不同类型文档中固有的噪声水平各异。我们提出了主题 - 噪声模型，这是一种新型的主题模型，它联合对主题和噪声分布进行建模，以生成更准确、灵活的文档表示，而不论其来源和质量差异如何。我们的主题 - 噪声模型，即主题噪声判别器（TND），借助词嵌入空间并行近似主题和噪声分布。虽然主题 - 噪声模型对于通常源自社交媒体平台的短且有噪声的文档类型很重要，但TND也可用于报纸等更传统的数据源。TND本身会生成一个噪声分布，当与其他生成式主题模型结合使用时，可以产生更连贯、多样的主题集。我们使用潜在狄利克雷分配（LDA）展示了这种方法的有效性，并证明了TND在有噪声文档集合中提高LDA主题质量的能力。最后，研究人员开始使用多个来源生成主题，并发现他们需要一种基于不同来源的文本识别核心主题集的方法。我们提出使用跨源主题融合（CSTB），该方法将主题集映射到一个 - 部图，并通过识别具有特定链接属性的子图来识别融合来自多个来源主题的核心主题。我们通过多个领域和数据源的大型真实世界数据集，实证证明了主题 - 噪声模型和CSTB的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cfe7/9842404/6f46c16a6978/10115_2022_1805_Fig1_HTML.jpg

相似文献

Using topic-noise models to generate domain-specific topics across data sources.使用主题-噪声模型跨数据源生成特定领域的主题。

Knowl Inf Syst. 2023;65(5):2159-2186. doi: 10.1007/s10115-022-01805-2. Epub 2023 Jan 16.

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.研究基于神经主题模型的词向量有效利用，以实现短文本的可解释主题。

Sensors (Basel). 2022 Jan 23;22(3):852. doi: 10.3390/s22030852.

Web content topic modeling using LDA and HTML tags.使用潜在狄利克雷分配（LDA）和HTML标签的网页内容主题建模

PeerJ Comput Sci. 2023 Jul 11;9:e1459. doi: 10.7717/peerj-cs.1459. eCollection 2023.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Extracting information and inferences from a large text corpus.从大型文本语料库中提取信息和推论。

Int J Inf Technol. 2023;15(1):435-445. doi: 10.1007/s41870-022-01123-4. Epub 2022 Nov 20.

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data.使用推特数据进行伪文档模拟，以比较LDA、GSDMM和GPM主题模型在短文本和稀疏文本上的表现。

Comput Stat. 2023;38(2):647-674. doi: 10.1007/s00180-022-01246-z. Epub 2022 Jul 9.

Phrase Based Topic Modeling for Semantic Information Processing in Biomedicine.用于生物医学语义信息处理的基于短语的主题建模

Proc Int Conf Mach Learn Appl. 2013 Dec;2013:440-445. doi: 10.1109/ICMLA.2013.89. Epub 2014 Apr 10.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.文本网络主题：基于文本分类的词群分组作为主题及主题评分

Front Genet. 2022 Jun 20;13:893378. doi: 10.3389/fgene.2022.893378. eCollection 2022.

Evaluation of clustering and topic modeling methods over health-related tweets and emails.健康相关推文和电子邮件的聚类和主题建模方法评估。

Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.

Concerns Expressed by Chinese Social Media Users During the COVID-19 Pandemic: Content Analysis of Sina Weibo Microblogging Data.新冠疫情期间中国社交媒体用户表达的担忧：对新浪微博数据的内容分析

J Med Internet Res. 2020 Nov 26;22(11):e22152. doi: 10.2196/22152.

本文引用的文献

Topic segmentation via community detection in complex networks.通过复杂网络中的社区检测进行主题分割

Chaos. 2016 Jun;26(6):063120. doi: 10.1063/1.4954215.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用主题-噪声模型跨数据源生成特定领域的主题。

Using topic-noise models to generate domain-specific topics across data sources.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献