评估大语言模型在学习者汉语中的语法错误纠正表现。

Evaluating LLMs' grammatical error correction performance in learner Chinese.

机构信息

Yantai Institute of Technology, Yantai, China.

出版信息

PLoS One. 2024 Oct 30;19(10):e0312881. doi: 10.1371/journal.pone.0312881. eCollection 2024.

DOI:10.1371/journal.pone.0312881

PMID:39476066

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11524451/

Abstract

Large language models (LLMs) have recently exhibited significant capabilities in various English NLP tasks. However, their performance in Chinese grammatical error correction (CGEC) remains unexplored. This study evaluates the abilities of state-of-the-art LLMs in correcting learner Chinese errors from a corpus linguistic perspective. The performance of LLMs is assessed using standard evaluation metrics of MaxMatch score. Keyword and key n-gram analyses are conducted to quantitatively explore linguistic features that differentiate LLM outputs from those of human annotators. LLMs' performance in syntactic and semantic dimensions is further qualitatively analyzed based on these probes of keywords and key n-grams. Results show that LLMs achieve a relatively higher performance in test datasets with multiple annotators and low performance in those with a single annotator. LLMs tend to overcorrect wrong sentences, under the explicit prompt of the "minimal edit" strategy, by using more linguistic devices to generate fluent and grammatical sentences. Furthermore, they struggle with under-correction and hallucination in reasoning-dependent situations. These findings highlight the strengths and limitations of LLMs in CGEC, suggesting that future efforts should focus on refining overcorrection tendencies and improving the handling of complex semantic contexts.

摘要

大型语言模型（LLMs）最近在各种英语自然语言处理任务中表现出了显著的能力。然而，它们在汉语语法错误修正（CGEC）方面的表现仍有待探索。本研究从语料库语言学的角度评估了最先进的 LLM 在修正学习者汉语错误方面的能力。使用 MaxMatch 得分的标准评估指标来评估 LLM 的性能。关键词和关键 n 元组分析用于定量探索区分 LLM 输出和人类注释者输出的语言特征。进一步基于这些关键词和关键 n 元组的探针对句法和语义维度的 LLM 性能进行定性分析。结果表明，在具有多个注释者的测试数据集上，LLMs 的性能相对较高，而在具有单个注释者的测试数据集上的性能较低。在明确提示“最小编辑”策略的情况下，LLMs 倾向于过度纠正错误的句子，使用更多的语言手段生成流畅和语法正确的句子。此外，它们在依赖推理的情况下难以进行适当的纠正和避免产生幻觉。这些发现突出了 LLM 在 CGEC 中的优势和局限性，表明未来的努力应集中在改进过度纠正倾向和提高处理复杂语义上下文的能力上。