• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

文本分类调查:意大利语的实际视角。

A survey on text classification: Practical perspectives on the Italian language.

机构信息

Department of Management, Ca' Foscari University, Venice, Italy.

Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University, Venice, Italy.

出版信息

PLoS One. 2022 Jul 6;17(7):e0270904. doi: 10.1371/journal.pone.0270904. eCollection 2022.

DOI:10.1371/journal.pone.0270904
PMID:35793328
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9258888/
Abstract

Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.

摘要

文本分类方法在过去十年中取得了飞速的发展,这得益于深度学习的成功。从历史上看,最先进的方法是为英语数据集开发和进行基准测试的,而其他语言则不得不迎头赶上,并应对不可避免的语言挑战。本文提供了一个具有实际和语言学内涵的调查,展示了将现代文本分类算法应用于英语以外的语言所带来的复杂性和挑战。我们从意大利语的角度来探讨这个问题,并详细讨论了与特定任务数据集稀缺相关的问题,以及现代方法计算成本高昂所带来的问题。我们通过提供一个经过深入研究的意大利语可用数据集列表,并将其与我们用于比较的法语数据集列表进行比较,来证明这一点。为了模拟真实的实际场景,我们将一些有代表性的方法应用于意大利语、法语和英语的定制多标签分类数据集。最后,我们从语言包容性的角度讨论结果、未来的挑战和研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/a4486a043cd8/pone.0270904.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/1bd4cc83a2bb/pone.0270904.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/15fd83c904b5/pone.0270904.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/f55e96320bc9/pone.0270904.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/a4486a043cd8/pone.0270904.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/1bd4cc83a2bb/pone.0270904.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/15fd83c904b5/pone.0270904.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/f55e96320bc9/pone.0270904.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92be/9258888/a4486a043cd8/pone.0270904.g004.jpg

相似文献

1
A survey on text classification: Practical perspectives on the Italian language.文本分类调查:意大利语的实际视角。
PLoS One. 2022 Jul 6;17(7):e0270904. doi: 10.1371/journal.pone.0270904. eCollection 2022.
2
A cross-linguistic study of real-word and non-word repetition as predictors of grammatical competence in children with typical language development.跨语言研究真实单词和非单词重复对典型语言发育儿童语法能力的预测作用。
Int J Lang Commun Disord. 2011 Sep-Oct;46(5):564-78. doi: 10.1111/j.1460-6984.2011.00008.x. Epub 2011 Sep 1.
3
Assessment of minority language skills in English-Irish-speaking bilingual children: A survey of SLT perspectives and current practices.英语-爱尔兰语双语儿童的少数民族语言技能评估:言语和语言治疗师观点及当前实践的调查
Int J Lang Commun Disord. 2022 Jan;57(1):63-77. doi: 10.1111/1460-6984.12674. Epub 2021 Oct 18.
4
Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach.基于自然语言处理技术的意大利病理报告中癌症形态的自动分类:一种基于规则的方法。
J Biomed Inform. 2021 Apr;116:103712. doi: 10.1016/j.jbi.2021.103712. Epub 2021 Feb 18.
5
Let's all speak together! Exploring the masking effects of various languages on spoken word identification in multi-linguistic babble.让我们一起发声!探索多种语言在多语言背景噪声中对口语识别的掩蔽效应。
PLoS One. 2013 Jun 12;8(6):e65668. doi: 10.1371/journal.pone.0065668. Print 2013.
6
iSentenizer-μ: multilingual sentence boundary detection model.iSentenizer-μ:多语言句子边界检测模型。
ScientificWorldJournal. 2014;2014:196574. doi: 10.1155/2014/196574. Epub 2014 Apr 15.
7
Early lexical and syntactic development in Quebec French and English: implications for cross-linguistic and bilingual assessment.魁北克法语和英语的早期词汇及句法发展:对跨语言和双语评估的启示
Int J Lang Commun Disord. 2005 Jul-Sep;40(3):243-78. doi: 10.1080/13682820410001729655.
8
The statistical signature of morphosyntax: a study of Hungarian and Italian infant-directed speech.形态句法的统计特征:对匈牙利语和意大利语婴儿导向语的研究。
Cognition. 2012 Nov;125(2):263-87. doi: 10.1016/j.cognition.2012.06.010. Epub 2012 Aug 6.
9
Word Order Typology Interacts With Linguistic Complexity: A Cross-Linguistic Corpus Study.语序类型学与语言复杂性相互作用:一项跨语言语料库研究。
Cogn Sci. 2020 Apr;44(4):e12822. doi: 10.1111/cogs.12822.
10
Using hybridization networks to retrace the evolution of Indo-European languages.利用杂交网络追溯印欧语系语言的演变。
BMC Evol Biol. 2016 Sep 6;16(1):180. doi: 10.1186/s12862-016-0745-6.

引用本文的文献

1
Assemble the shallow or integrate a deep? Toward a lightweight solution for glyph-aware Chinese text classification.组合浅层还是集成深层?迈向有向汉字分类的轻量级解决方案。
PLoS One. 2023 Jul 28;18(7):e0289204. doi: 10.1371/journal.pone.0289204. eCollection 2023.

本文引用的文献

1
An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian.基于 BERT 的 Twitter 情感分析有效流水线:意大利语案例研究。
Sensors (Basel). 2020 Dec 28;21(1):133. doi: 10.3390/s21010133.
2
CAS: corpus of clinical cases in French.法语临床病例语料库。
J Biomed Semantics. 2020 Aug 6;11(1):7. doi: 10.1186/s13326-020-00225-x.
3
The influence of preprocessing on text classification using a bag-of-words representation.基于词袋模型的文本分类中预处理的影响。
PLoS One. 2020 May 1;15(5):e0232525. doi: 10.1371/journal.pone.0232525. eCollection 2020.
4
Morpheme matching based text tokenization for a scarce resourced language.基于词素匹配的稀缺资源语言文本分词。
PLoS One. 2013 Aug 21;8(8):e68178. doi: 10.1371/journal.pone.0068178. eCollection 2013.
5
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.