Suppr超能文献

利用词嵌入增强共现网络:一项统计分析。

Leveraging word embeddings to enhance co-occurrence networks: A statistical analysis.

作者信息

Amancio Diego R, Machicao Jeaneth, Quispe Laura V C

机构信息

Institute of Mathematics and Computer Science - USP, Avenida Trabalhador S ao-carlense, no 400, CEP 13566-590, S ao Carlos, SP, Brazil.

Escola Politécnica da Universidade de S ao Paulo (EPUSP), São Paulo, Brazil.

出版信息

PLoS One. 2025 Jul 11;20(7):e0327421. doi: 10.1371/journal.pone.0327421. eCollection 2025.

Abstract

Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. In this study, we investigate two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have both positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results, derived from enriching networks with FastText embeddings, offer a guideline for identifying the most appropriate network metrics for specific applications, based on typical text length and the nature of the task.

摘要

最近的研究探讨了使用词嵌入向词共现网络添加虚拟边,以增强图表示,特别是对于短文本。虽然这些丰富的网络已取得了一些成功,但将语义边纳入传统共现网络的影响仍不确定。在本研究中,我们调查了基于文本的网络模型的两个关键统计属性。首先,我们评估网络指标是否能有效区分无意义文本和有意义文本。其次,我们分析这些指标对文本的句法或语义方面是否更敏感。我们的结果表明,根据具体的网络指标,纳入虚拟边可能会产生积极和消极影响。例如,在短文本中,平均最短路径和接近中心性的信息量会提高,而随着添加更多虚拟边,聚类系数的信息量会降低。此外,我们发现包含停用词会影响丰富网络的统计属性。我们基于FastText嵌入丰富网络得出的结果,为根据典型文本长度和任务性质确定特定应用最合适的网络指标提供了指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b426/12250493/b687acceadc4/pone.0327421.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验