Department of Management, Ca' Foscari University, Venice, Italy.
Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University, Venice, Italy.
PLoS One. 2022 Jul 6;17(7):e0270904. doi: 10.1371/journal.pone.0270904. eCollection 2022.
Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.
文本分类方法在过去十年中取得了飞速的发展,这得益于深度学习的成功。从历史上看,最先进的方法是为英语数据集开发和进行基准测试的,而其他语言则不得不迎头赶上,并应对不可避免的语言挑战。本文提供了一个具有实际和语言学内涵的调查,展示了将现代文本分类算法应用于英语以外的语言所带来的复杂性和挑战。我们从意大利语的角度来探讨这个问题,并详细讨论了与特定任务数据集稀缺相关的问题,以及现代方法计算成本高昂所带来的问题。我们通过提供一个经过深入研究的意大利语可用数据集列表,并将其与我们用于比较的法语数据集列表进行比较,来证明这一点。为了模拟真实的实际场景,我们将一些有代表性的方法应用于意大利语、法语和英语的定制多标签分类数据集。最后,我们从语言包容性的角度讨论结果、未来的挑战和研究方向。