Gayle Alberto Alexander, Shimaoka Motomu
Center for Medical and Nursing Education, Mie University School of Medicine, Mie, Japan.
Department of Immunology, Mie University Graduate School of Medicine, Mie, Japan.
PLoS One. 2017 Feb 17;12(2):e0172338. doi: 10.1371/journal.pone.0172338. eCollection 2017.
The predominance of English in scientific research has created hurdles for "non-native speakers" of English. Here we present a novel application of native language identification (NLI) for the assessment of medical-scientific writing. For this purpose, we created a novel classification system whereby scoring would be based solely on text features found to be distinctive among native English speakers (NS) within a given context. We dubbed this the "Genuine Index" (GI).
This methodology was validated using a small set of journals in the field of pediatric oncology. Our dataset consisted of 5,907 abstracts, representing work from 77 countries. A support vector machine (SVM) was used to generate our model and for scoring.
Accuracy, precision, and recall of the classification model were 93.3%, 93.7%, and 99.4%, respectively. Class specific F-scores were 96.5% for NS and 39.8% for our benchmark class, Japan. Overall kappa was calculated to be 37.2%. We found significant differences between countries with respect to the GI score. Significant correlation was found between GI scores and two validated objective measures of writing proficiency and readability. Two sets of key terms and phrases differentiating NS and non-native writing were identified.
Our GI model was able to detect, with a high degree of reliability, subtle differences between the terms and phrasing used by native and non-native speakers in peer reviewed journals, in the field of pediatric oncology. In addition, L1 language transfer was found to be very likely to survive revision, especially in non-Western countries such as Japan. These findings show that even when the language used is technically correct, there may still be some phrasing or usage that impact quality.
英语在科研领域的主导地位给英语“非母语者”带来了障碍。在此,我们展示一种用于评估医学科研写作的母语识别(NLI)新应用。为此,我们创建了一种新颖的分类系统,其评分将仅基于在给定语境中发现的以英语为母语者(NS)所特有的文本特征。我们将其称为“真实指数”(GI)。
使用一小部分儿科肿瘤学领域的期刊对该方法进行验证。我们的数据集由5907篇摘要组成,代表了来自77个国家的研究成果。使用支持向量机(SVM)生成我们的模型并进行评分。
分类模型的准确率、精确率和召回率分别为93.3%、93.7%和99.4%。NS类别的特定F值为96.5%,我们的基准类别日本为39.8%。总体kappa值计算为37.2%。我们发现不同国家在GI分数方面存在显著差异。GI分数与两种经过验证的写作熟练度和可读性客观指标之间存在显著相关性。确定了两组区分NS和非母语写作的关键术语和短语。
我们的GI模型能够高度可靠地检测儿科肿瘤学领域同行评审期刊中母语者和非母语者使用的术语和措辞之间的细微差异。此外,发现母语语言迁移很可能在修订后仍然存在,尤其是在日本等非西方国家。这些发现表明,即使使用的语言在技术上是正确的,仍可能存在一些影响质量的措辞或用法。