Suppr超能文献

基于数据挖掘技术的翻译中文与原文中文的熵判别。

Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques.

机构信息

Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China.

School of Applied Mathematics, Guangdong University of Technology, Guangdong, China.

出版信息

PLoS One. 2022 Mar 24;17(3):e0265633. doi: 10.1371/journal.pone.0265633. eCollection 2022.

Abstract

The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform trigram, POS (Part-of-speech) unigram, POS bigram and POS trigram entropy from two balanced Chinese comparable corpora (translated vs non-translated) for data mining and analysis. We then applied four data mining techniques including Support Vector Machines (SVMs), Linear discriminant analysis (LDA), Random Forest (RF) and Multilayer Perceptron (MLP) to distinguish translated Chinese from original Chinese based on these seven features. Our results show that SVMs is the most robust and effective classifier, yielding an AUC of 90.5% and an accuracy rate of 84.3%. Our results have affirmed the hypothesis that translational language is categorically different from original language. Our research demonstrates that combining information-theoretic indicator of Shannon's entropy together with machine learning techniques can provide a novel approach for studying translation as a unique communicative activity. This study has yielded new insights for corpus-based studies on the translationese phenomenon in the field of translation studies.

摘要

本研究报告了一种使用数据挖掘技术,基于单语可比语料库,区分翻译和非翻译的原始中文的方法。我们从两个平衡的中文可比语料库(翻译与非翻译)中操作了七个基于熵的度量标准,包括字符、词形一元、词形二元和词形三元、词性(POS)一元、POS 二元和 POS 三元熵,用于数据挖掘和分析。然后,我们应用了四种数据挖掘技术,包括支持向量机(SVMs)、线性判别分析(LDA)、随机森林(RF)和多层感知机(MLP),基于这七个特征来区分翻译中文和原始中文。结果表明,SVMs 是最稳健和有效的分类器,AUC 为 90.5%,准确率为 84.3%。我们的研究结果证实了翻译语言与原始语言在类别上存在显著差异的假设。我们的研究表明,将香农熵的信息论指标与机器学习技术相结合,可以为研究翻译作为一种独特的交际活动提供一种新的方法。这项研究为翻译研究领域的翻译体现象的基于语料库的研究提供了新的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/06a5/8947138/8a7412f93640/pone.0265633.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验