• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

探测未知文本的统计属性:在伏尼契手稿中的应用。

Probing the statistical properties of unknown texts: application to the Voynich Manuscript.

机构信息

Institute of Physics of São Carlos, University of São Paulo, São Carlos, São Paulo, Brazil.

出版信息

PLoS One. 2013 Jul 2;8(7):e67310. doi: 10.1371/journal.pone.0067310. Print 2013.

DOI:10.1371/journal.pone.0067310
PMID:23844002
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3699599/
Abstract

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

摘要

虽然使用统计物理方法来分析大型语料库已经有助于揭示文本中的许多模式,但对于句法和语义因素之间的相互依存关系还没有进行全面的研究。在本研究中,我们提出了一种确定文本(例如,用未知字母书写)是否与自然语言兼容以及它可能属于哪种语言的框架。该方法基于三种类型的统计测量,即从文本中单词属性的一阶统计量、表示文本的复杂网络的拓扑结构以及将文本视为时间序列的间歇概念中获得。在 15 种不同语言的新约圣经以及英语和葡萄牙语的不同书籍中进行了比较实验,以量化不同测量值对语言和书籍中所讲述的故事的依赖性。在区分真实文本与其随机版本时发现的有信息量的指标包括配价、单词的度数和选择性。作为说明,我们分析了一个称为伏尼契手稿的未破译的中世纪手稿。我们表明,它与自然语言大多兼容,与随机文本不兼容。我们还获得了伏尼契手稿关键词的候选者,这可能有助于破译它。由于我们能够识别出与语法比语义更相关的统计测量,因此该框架也可以用于语言相关应用中的文本分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/ba34576eb193/pone.0067310.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/18853e618218/pone.0067310.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/1d61619c0ebe/pone.0067310.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/008dc92fcdd3/pone.0067310.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/ba34576eb193/pone.0067310.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/18853e618218/pone.0067310.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/1d61619c0ebe/pone.0067310.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/008dc92fcdd3/pone.0067310.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa8b/3699599/ba34576eb193/pone.0067310.g004.jpg

相似文献

1
Probing the statistical properties of unknown texts: application to the Voynich Manuscript.探测未知文本的统计属性:在伏尼契手稿中的应用。
PLoS One. 2013 Jul 2;8(7):e67310. doi: 10.1371/journal.pone.0067310. Print 2013.
2
Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis.《伏尼契手稿中的关键词与共现模式:信息论分析》
PLoS One. 2013 Jun 21;8(6):e66344. doi: 10.1371/journal.pone.0066344. Print 2013.
3
Random texts do not exhibit the real Zipf's law-like rank distribution.随机文本并不表现出真正的齐普夫定律式的等级分布。
PLoS One. 2010 Mar 9;5(3):e9411. doi: 10.1371/journal.pone.0009411.
4
New standardised texts for assessing reading performance in four European languages.用于评估四种欧洲语言阅读能力的新标准化文本。
Br J Ophthalmol. 2006 Apr;90(4):480-4. doi: 10.1136/bjo.2005.087379.
5
A Mathematical Model for Universal Semantics.通用语义的数学模型。
IEEE Trans Pattern Anal Mach Intell. 2022 Mar;44(3):1124-1132. doi: 10.1109/TPAMI.2020.3022533. Epub 2022 Feb 3.
6
Syntactic structures in languages and biology.语言与生物学中的句法结构。
Cogn Process. 2008 Aug;9(3):153-8. doi: 10.1007/s10339-007-0194-7. Epub 2007 Oct 19.
7
Quantifying the information in the long-range order of words: semantic structures and universal linguistic constraints.量化词的长程顺序中的信息:语义结构与普遍语言限制
Cortex. 2014 Jun;55:5-16. doi: 10.1016/j.cortex.2013.08.008. Epub 2013 Aug 29.
8
Recurrence Networks in Natural Languages.自然语言中的递归网络。
Entropy (Basel). 2019 May 23;21(5):517. doi: 10.3390/e21050517.
9
The count-mass distinction in typically developing and grammatically specifically language impaired children: new evidence on the role of syntax and semantics.正常发育儿童与语法特定性语言障碍儿童的数量-质量区分:关于句法和语义作用的新证据
J Commun Disord. 2008 May-Jun;41(3):274-303. doi: 10.1016/j.jcomdis.2007.11.001. Epub 2007 Dec 4.
10
The Role of Surface, Semantic and Grammatical Features on Simplification of Spanish Medical Texts: A User Study.表面、语义和语法特征对西班牙语医学文本简化的作用:一项用户研究。
AMIA Annu Symp Proc. 2018 Apr 16;2017:1322-1331. eCollection 2017.

引用本文的文献

1
Leveraging word embeddings to enhance co-occurrence networks: A statistical analysis.利用词嵌入增强共现网络:一项统计分析。
PLoS One. 2025 Jul 11;20(7):e0327421. doi: 10.1371/journal.pone.0327421. eCollection 2025.
2
Cancer Segmentation by Entropic Analysis of Ordered Gene Expression Profiles.通过有序基因表达谱的熵分析进行癌症分割
Entropy (Basel). 2022 Nov 29;24(12):1744. doi: 10.3390/e24121744.
3
Understanding the spatial dimension of natural language by measuring the spatial semantic similarity of words through a scalable geospatial context window.

本文引用的文献

1
Empirical analysis of collective human behavior for extraordinary events in the blogosphere.博客圈中异常事件的集体人类行为实证分析。
Phys Rev E Stat Nonlin Soft Matter Phys. 2013 Jan;87(1):012805. doi: 10.1103/PhysRevE.87.012805. Epub 2013 Jan 11.
2
Languages cool as they expand: allometric scaling and the decreasing need for new words.语言随着扩展而变得更加酷:异速生长和对新词的需求减少。
Sci Rep. 2012;2:943. doi: 10.1038/srep00943. Epub 2012 Dec 10.
3
On the origin of long-range correlations in texts.文本中长程相关性的起源。
通过使用可扩展的地理空间上下文窗口来测量词的空间语义相似性,从而理解自然语言的空间维度。
PLoS One. 2020 Jul 23;15(7):e0236347. doi: 10.1371/journal.pone.0236347. eCollection 2020.
4
Complexity-entropy analysis at different levels of organisation in written language.书面语言在不同组织层次上的复杂性-熵分析。
PLoS One. 2019 May 8;14(5):e0214863. doi: 10.1371/journal.pone.0214863. eCollection 2019.
5
Authorship attribution based on Life-Like Network Automata.基于类生命网络自动机的作者归因。
PLoS One. 2018 Mar 22;13(3):e0193703. doi: 10.1371/journal.pone.0193703. eCollection 2018.
6
A Complex Network Approach to Stylometry.一种用于文体学的复杂网络方法。
PLoS One. 2015 Aug 27;10(8):e0136076. doi: 10.1371/journal.pone.0136076. eCollection 2015.
7
Probing the topological properties of complex networks modeling short written texts.探究用于模拟简短书面文本的复杂网络的拓扑特性。
PLoS One. 2015 Feb 26;10(2):e0118394. doi: 10.1371/journal.pone.0118394. eCollection 2015.
Proc Natl Acad Sci U S A. 2012 Jul 17;109(29):11582-7. doi: 10.1073/pnas.1117723109. Epub 2012 Jul 2.
4
Statistical laws governing fluctuations in word use from word birth to word death.从词语诞生到词语消亡过程中,词语使用波动的统计规律。
Sci Rep. 2012;2:313. doi: 10.1038/srep00313. Epub 2012 Mar 15.
5
Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures.在不同的文化中,昼夜节律和季节性情绪会随工作、睡眠和日照时间的变化而变化。
Science. 2011 Sep 30;333(6051):1878-81. doi: 10.1126/science.1202775.
6
Automatic network fingerprinting through single-node motifs.通过单节点模式进行自动网络指纹识别。
PLoS One. 2011 Jan 31;6(1):e15765. doi: 10.1371/journal.pone.0015765.
7
Quantitative analysis of culture using millions of digitized books.利用数百万本数字化书籍进行文化的定量分析。
Science. 2011 Jan 14;331(6014):176-82. doi: 10.1126/science.1199644. Epub 2010 Dec 16.
8
Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words.超越词频:词的时间分布中的爆发、沉寂和标度。
PLoS One. 2009 Nov 11;4(11):e7678. doi: 10.1371/journal.pone.0007678.
9
Scaling laws of human interaction activity.人类互动活动的标度律。
Proc Natl Acad Sci U S A. 2009 Aug 4;106(31):12640-5. doi: 10.1073/pnas.0902667106. Epub 2009 Jul 14.
10
Modeling statistical properties of written text.书面文本的统计特性建模。
PLoS One. 2009;4(4):e5372. doi: 10.1371/journal.pone.0005372. Epub 2009 Apr 29.