• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

语言结构的分形模式。

On the fractal patterns of language structures.

机构信息

Departamento de Ciências Econômicas, Faculdade de Ciências Econômicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil.

Departamento de Física, Instituto de Ciências Exatas e Biológicas, Universidade Federal de Ouro Preto, Ouro Preto, Minas Gerais, Brasil.

出版信息

PLoS One. 2023 May 18;18(5):e0285630. doi: 10.1371/journal.pone.0285630. eCollection 2023.

DOI:10.1371/journal.pone.0285630
PMID:37200318
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10194960/
Abstract

Natural Language Processing (NLP) makes use of Artificial Intelligence algorithms to extract meaningful information from unstructured texts, i.e., content that lacks metadata and cannot easily be indexed or mapped onto standard database fields. It has several applications, from sentiment analysis and text summary to automatic language translation. In this work, we use NLP to figure out similar structural linguistic patterns among several different languages. We apply the word2vec algorithm that creates a vector representation for the words in a multidimensional space that maintains the meaning relationship between the words. From a large corpus we built this vectorial representation in a 100-dimensional space for English, Portuguese, German, Spanish, Russian, French, Chinese, Japanese, Korean, Italian, Arabic, Hebrew, Basque, Dutch, Swedish, Finnish, and Estonian. Then, we calculated the fractal dimensions of the structure that represents each language. The structures are multi-fractals with two different dimensions that we use, in addition to the token-dictionary size rate of the languages, to represent the languages in a three-dimensional space. Finally, analyzing the distance among languages in this space, we conclude that the closeness there is tendentially related to the distance in the Phylogenetic tree that depicts the lines of evolutionary descent of the languages from a common ancestor.

摘要

自然语言处理(NLP)利用人工智能算法从非结构化文本中提取有意义的信息,即缺乏元数据且难以索引或映射到标准数据库字段的内容。它有多种应用,从情感分析和文本摘要到自动语言翻译。在这项工作中,我们使用 NLP 来找出几种不同语言之间类似的结构语言模式。我们应用 word2vec 算法,该算法为多维空间中的单词创建向量表示,保持单词之间的意义关系。从一个大型语料库中,我们在 100 维空间中为英语、葡萄牙语、德语、西班牙语、俄语、法语、中文、日语、韩语、意大利语、阿拉伯语、希伯来语、巴斯克语、荷兰语、瑞典语、芬兰语和爱沙尼亚语构建了这种向量表示。然后,我们计算了代表每种语言的结构的分形维数。这些结构是具有两个不同维度的多重分形,我们除了使用语言的令牌-词典大小率之外,还将其用于在三维空间中表示语言。最后,通过分析这些空间中语言之间的距离,我们得出结论,语言之间的接近程度与描述语言从共同祖先进化而来的谱系树中的距离有趋势相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d28f7b1fd1a4/pone.0285630.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/3a08acc75044/pone.0285630.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/6cf7b2353182/pone.0285630.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/e0a102dfaf31/pone.0285630.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/587fc90212d8/pone.0285630.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/c69594a26689/pone.0285630.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/84018d28efa6/pone.0285630.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d7267f1b757c/pone.0285630.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d6c55ef2d4ed/pone.0285630.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/ecf51942440d/pone.0285630.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d28f7b1fd1a4/pone.0285630.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/3a08acc75044/pone.0285630.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/6cf7b2353182/pone.0285630.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/e0a102dfaf31/pone.0285630.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/587fc90212d8/pone.0285630.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/c69594a26689/pone.0285630.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/84018d28efa6/pone.0285630.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d7267f1b757c/pone.0285630.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d6c55ef2d4ed/pone.0285630.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/ecf51942440d/pone.0285630.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c32/10194960/d28f7b1fd1a4/pone.0285630.g011.jpg

相似文献

1
On the fractal patterns of language structures.语言结构的分形模式。
PLoS One. 2023 May 18;18(5):e0285630. doi: 10.1371/journal.pone.0285630. eCollection 2023.
2
Neural machine translation of clinical texts between long distance languages.长距离语言之间的临床文本的神经机器翻译。
J Am Med Inform Assoc. 2019 Dec 1;26(12):1478-1487. doi: 10.1093/jamia/ocz110.
3
Building lexicon-based sentiment analysis model for low-resource languages.为低资源语言构建基于词典的情感分析模型。
MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec.
4
Inventory of tools for Dutch clinical language processing.荷兰临床语言处理工具清单。
Stud Health Technol Inform. 2012;180:245-9.
5
[The analysis of CIRSmedical.de using Natural Language Processing].[使用自然语言处理对CIRSmedical.de进行分析]
Z Evid Fortbild Qual Gesundhwes. 2022 Apr;169:1-11. doi: 10.1016/j.zefq.2021.12.002. Epub 2022 Feb 17.
6
Use of "off-the-shelf" information extraction algorithms in clinical informatics: A feasibility study of MetaMap annotation of Italian medical notes.临床信息学中“现成可用”信息提取算法的应用:意大利医学记录的MetaMap注释可行性研究。
J Biomed Inform. 2016 Oct;63:22-32. doi: 10.1016/j.jbi.2016.07.017. Epub 2016 Jul 18.
7
Development and testing of a multi-lingual Natural Language Processing-based deep learning system in 10 languages for COVID-19 pandemic crisis: A multi-center study.开发和测试一个基于多语言自然语言处理的深度学习系统,用于 10 种语言的 COVID-19 大流行危机:一项多中心研究。
Front Public Health. 2023 Feb 13;11:1063466. doi: 10.3389/fpubh.2023.1063466. eCollection 2023.
8
iSentenizer-μ: multilingual sentence boundary detection model.iSentenizer-μ:多语言句子边界检测模型。
ScientificWorldJournal. 2014;2014:196574. doi: 10.1155/2014/196574. Epub 2014 Apr 15.
9
Natural language processing of medical texts within the HELIOS environment.HELIOS环境下医学文本的自然语言处理
Comput Methods Programs Biomed. 1994 Dec;45 Suppl:S79-96.
10
Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach.基于自然语言处理技术的意大利病理报告中癌症形态的自动分类:一种基于规则的方法。
J Biomed Inform. 2021 Apr;116:103712. doi: 10.1016/j.jbi.2021.103712. Epub 2021 Feb 18.

本文引用的文献

1
Differences in fractal patterns and characteristic periodicities between word salads and normal sentences: Interference of meaning and sound.词语色拉和正常句子之间分形模式和特征周期性的差异:意义和声音的干扰。
PLoS One. 2021 Feb 18;16(2):e0247133. doi: 10.1371/journal.pone.0247133. eCollection 2021.
2
Classification of Literary Works: Fractality and Complexity of the Narrative, Essay, and Research Article.文学作品的分类:叙事、散文和研究文章的分形性与复杂性
Entropy (Basel). 2020 Aug 17;22(8):904. doi: 10.3390/e22080904.
3
The Fractal Patterns of Words in a Text: A Method for Automatic Keyword Extraction.
文本中单词的分形模式:一种自动提取关键词的方法。
PLoS One. 2015 Jun 19;10(6):e0130617. doi: 10.1371/journal.pone.0130617. eCollection 2015.
4
Random texts do not exhibit the real Zipf's law-like rank distribution.随机文本并不表现出真正的齐普夫定律式的等级分布。
PLoS One. 2010 Mar 9;5(3):e9411. doi: 10.1371/journal.pone.0009411.
5
The consequences of Zipf's law for syntax and symbolic reference.齐普夫定律对句法和符号指代的影响。
Proc Biol Sci. 2005 Mar 7;272(1562):561-5. doi: 10.1098/rspb.2004.2957.
6
Language: syntax for free?
Nature. 2005 Mar 17;434(7031):289. doi: 10.1038/434289a.
7
Language-tree divergence times support the Anatolian theory of Indo-European origin.语言树的分化时间支持印欧语系起源的安纳托利亚理论。
Nature. 2003 Nov 27;426(6965):435-9. doi: 10.1038/nature02029.
8
Emergence of scaling in random networks.随机网络中幂律分布的出现。
Science. 1999 Oct 15;286(5439):509-12. doi: 10.1126/science.286.5439.509.