• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用信息拓扑估计人工语言中类似句子的结构。

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology.

作者信息

Back Andrew D, Wiles Janet

机构信息

School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia.

出版信息

Entropy (Basel). 2022 Jun 22;24(7):859. doi: 10.3390/e24070859.

DOI:10.3390/e24070859
PMID:35885083
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9317616/
Abstract

Estimating sentence-like units and sentence boundaries in human language is an important task in the context of natural language understanding. While this topic has been considered using a range of techniques, including rule-based approaches and supervised and unsupervised algorithms, a common aspect of these methods is that they inherently rely on a priori knowledge of human language in one form or another. Recently we have been exploring synthetic languages based on the concept of modeling behaviors using emergent languages. These synthetic languages are characterized by a small alphabet and limited vocabulary and grammatical structure. A particular challenge for synthetic languages is that there is generally no a priori language model available, which limits the use of many natural language processing methods. In this paper, we are interested in exploring how it may be possible to discover natural 'chunks' in synthetic language sequences in terms of sentence-like units. The problem is how to do this with no linguistic or semantic language model. Our approach is to consider the problem from the perspective of information theory. We extend the basis of information geometry and propose a new concept, which we term information topology, to model the incremental flow of information in natural sequences. We introduce an information topology view of the incremental information and incremental tangent angle of the Wasserstein-1 distance of the probabilistic symbolic language input. It is not suggested as a fully viable alternative for sentence boundary detection per se but provides a new conceptual method for estimating the structure and natural limits of information flow in language sequences but without any semantic knowledge. We consider relevant existing performance metrics such as the F-measure and indicate limitations, leading to the introduction of a new information-theoretic global performance based on modeled distributions. Although the methodology is not proposed for human language sentence detection, we provide some examples using human language corpora where potentially useful results are shown. The proposed model shows potential advantages for overcoming difficulties due to the disambiguation of complex language and potential improvements for human language methods.

摘要

在自然语言理解的背景下,估计人类语言中类似句子的单元和句子边界是一项重要任务。虽然已经使用了一系列技术来考虑这个主题,包括基于规则的方法以及监督和无监督算法,但这些方法的一个共同方面是它们本质上以某种形式依赖于人类语言的先验知识。最近,我们一直在探索基于使用涌现语言对行为进行建模的概念的合成语言。这些合成语言的特点是字母表小、词汇量有限且语法结构受限。合成语言面临的一个特殊挑战是通常没有可用的先验语言模型,这限制了许多自然语言处理方法的使用。在本文中,我们感兴趣的是探索如何有可能从类似句子的单元的角度在合成语言序列中发现自然的“块”。问题在于如何在没有语言或语义语言模型的情况下做到这一点。我们的方法是从信息论的角度考虑这个问题。我们扩展了信息几何的基础,并提出了一个新的概念,我们称之为信息拓扑,以对自然序列中的信息增量流动进行建模。我们引入了概率符号语言输入的Wasserstein-1距离的增量信息和增量切线角的信息拓扑视图。它本身并不是作为句子边界检测的完全可行替代方案提出的,而是提供了一种新的概念方法,用于估计语言序列中信息流的结构和自然限制,而无需任何语义知识。我们考虑了相关的现有性能指标,如F值,并指出了局限性,从而引入了基于建模分布的新的信息论全局性能指标。虽然该方法不是为人类语言句子检测提出的,但我们提供了一些使用人类语言语料库的示例,展示了可能有用的结果。所提出的模型显示出在克服复杂语言歧义带来的困难方面的潜在优势,以及对人类语言方法的潜在改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/30f9b2f0f424/entropy-24-00859-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/f85bd11c11b4/entropy-24-00859-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/4a6f596f43f9/entropy-24-00859-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/c824be3f7a05/entropy-24-00859-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/037b83821b57/entropy-24-00859-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/3e56e3591fc8/entropy-24-00859-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/0041b8bbc2c1/entropy-24-00859-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/c5cf2ecfb6f1/entropy-24-00859-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/30f9b2f0f424/entropy-24-00859-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/f85bd11c11b4/entropy-24-00859-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/4a6f596f43f9/entropy-24-00859-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/c824be3f7a05/entropy-24-00859-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/037b83821b57/entropy-24-00859-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/3e56e3591fc8/entropy-24-00859-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/0041b8bbc2c1/entropy-24-00859-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/c5cf2ecfb6f1/entropy-24-00859-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4e7c/9317616/30f9b2f0f424/entropy-24-00859-g008.jpg

相似文献

1
Estimating Sentence-like Structure in Synthetic Languages Using Information Topology.利用信息拓扑估计人工语言中类似句子的结构。
Entropy (Basel). 2022 Jun 22;24(7):859. doi: 10.3390/e24070859.
2
An Information Theoretic Approach to Symbolic Learning in Synthetic Languages.一种用于合成语言中符号学习的信息论方法。
Entropy (Basel). 2022 Feb 10;24(2):259. doi: 10.3390/e24020259.
3
iSentenizer-μ: multilingual sentence boundary detection model.iSentenizer-μ:多语言句子边界检测模型。
ScientificWorldJournal. 2014;2014:196574. doi: 10.1155/2014/196574. Epub 2014 Apr 15.
4
Neural network processing of natural language: II. Towards a unified model of corticostriatal function in learning sentence comprehension and non-linguistic sequencing.自然语言的神经网络处理:II. 迈向学习句子理解和非语言序列中皮质纹状体功能的统一模型。
Brain Lang. 2009 May-Jun;109(2-3):80-92. doi: 10.1016/j.bandl.2008.08.002. Epub 2008 Oct 5.
5
Symbolic connectionism in natural language disambiguation.
IEEE Trans Neural Netw. 1998;9(5):739-55. doi: 10.1109/72.712149.
6
Unsupervised inference of implicit biomedical events using context triggers.使用上下文触发器进行无监督的隐含生物医学事件推断。
BMC Bioinformatics. 2020 Jan 28;21(1):29. doi: 10.1186/s12859-020-3341-0.
7
Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.使用语言齐普夫-曼德勃罗-李模型对自然序列进行熵估计
Entropy (Basel). 2021 Aug 24;23(9):1100. doi: 10.3390/e23091100.
8
A grammar-based semantic similarity algorithm for natural language sentences.一种基于语法的自然语言句子语义相似度算法。
ScientificWorldJournal. 2014;2014:437162. doi: 10.1155/2014/437162. Epub 2014 Apr 10.
9
Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。
BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.
10
Low-frequency neural activity reflects rule-based chunking during speech listening.低频神经活动反映了言语听知觉中基于规则的组块化。
Elife. 2020 Apr 20;9:e55613. doi: 10.7554/eLife.55613.

本文引用的文献

1
An Information Theoretic Approach to Symbolic Learning in Synthetic Languages.一种用于合成语言中符号学习的信息论方法。
Entropy (Basel). 2022 Feb 10;24(2):259. doi: 10.3390/e24020259.
2
Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences.使用语言齐普夫-曼德勃罗-李模型对自然序列进行熵估计
Entropy (Basel). 2021 Aug 24;23(9):1100. doi: 10.3390/e23091100.
3
Visual statistical learning is modulated by arbitrary and natural categories.视觉统计学习受任意和自然类别调节。
Psychon Bull Rev. 2021 Aug;28(4):1281-1288. doi: 10.3758/s13423-021-01917-w. Epub 2021 Mar 31.
4
Statistically defined visual chunks engage object-based attention.统计定义的视觉块吸引基于对象的注意力。
Nat Commun. 2021 Jan 11;12(1):272. doi: 10.1038/s41467-020-20589-z.
5
An Automated Approach to Examining Pausing in the Speech of People With Dementia.一种自动分析痴呆症患者言语停顿的方法。
Am J Alzheimers Dis Other Demen. 2020 Jan-Dec;35:1533317520939773. doi: 10.1177/1533317520939773.
6
Brain networks for confidence weighting and hierarchical inference during probabilistic learning.概率学习过程中用于置信加权和层次推理的脑网络。
Proc Natl Acad Sci U S A. 2017 May 9;114(19):E3859-E3868. doi: 10.1073/pnas.1615773114. Epub 2017 Apr 24.
7
A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain.临床领域句子边界检测的定量与定性评估
AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:88-97. eCollection 2016.
8
Turn-taking in Human Communication--Origins and Implications for Language Processing.人类交际中的话轮转换——起源及其对语言加工的影响。
Trends Cogn Sci. 2016 Jan;20(1):6-14. doi: 10.1016/j.tics.2015.10.010. Epub 2015 Dec 1.
9
Graph Curvature for Differentiating Cancer Networks.用于区分癌症网络的图形曲率
Sci Rep. 2015 Jul 14;5:12323. doi: 10.1038/srep12323.
10
Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。
BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.