ReadMe++：用于多领域可读性评估的多语言模型基准测试

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment.

作者信息

Naous Tarek, Ryan Michael J, Lavrouk Anton, Chandra Mohit, Xu Wei

机构信息

College of Computing, Georgia Institute of Technology.

出版信息

Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:12230-12266. doi: 10.18653/v1/2024.emnlp-main.682.

DOI:10.18653/v1/2024.emnlp-main.682

PMID:40612444

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12225862/

Abstract

We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on developing robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme.

摘要

我们对用于多语言可读性评估的大语言模型进行了全面评估。现有的评估资源缺乏领域和语言多样性，限制了跨领域和跨语言分析的能力。本文介绍了ReadMe++，这是一个多语言多领域数据集，包含从112个不同数据源收集的阿拉伯语、英语、法语、印地语和俄语的9757个带有人工注释的句子。这个基准将鼓励开展关于开发强大的多语言可读性评估方法的研究。使用ReadMe++，我们在监督、无监督和少样本提示设置下对多语言和单语言语言模型进行基准测试。ReadMe++中的领域和语言多样性使我们能够测试更有效的少样本提示，并识别最先进的无监督方法中的缺点。我们的实验还揭示了通过在ReadMe++上训练的模型实现卓越的领域泛化和增强的跨语言转移能力的令人兴奋的结果。我们将公开提供我们的数据，并在以下网址发布一个使用我们训练的模型进行多语言句子可读性预测的Python包工具：https://github.com/tareknaous/readme 。

相似文献

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment.

Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:12230-12266. doi: 10.18653/v1/2024.emnlp-main.682.

MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain.

Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:17293-17319. doi: 10.18653/v1/2024.emnlp-main.958.

Watch and learn: leveraging expert knowledge and language for surgical video understanding.

Int J Comput Assist Radiol Surg. 2025 Jul 2. doi: 10.1007/s11548-025-03472-4.

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.

JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.

Language intervention in bilingual children with developmental language disorder: A systematic review.

Int J Lang Commun Disord. 2023 Mar;58(2):576-600. doi: 10.1111/1460-6984.12803. Epub 2022 Nov 25.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

Systematic review and validation of prediction rules for identifying children with serious infections in emergency departments and urgent-access primary care.

Health Technol Assess. 2012;16(15):1-100. doi: 10.3310/hta16150.

The quantity, quality and findings of network meta-analyses evaluating the effectiveness of GLP-1 RAs for weight loss: a scoping review.

Health Technol Assess. 2025 Jun 25:1-73. doi: 10.3310/SKHT8119.

Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.

Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.

Interventions for implementation of thromboprophylaxis in hospitalized patients at risk for venous thromboembolism.

Cochrane Database Syst Rev. 2018 Apr 24;4(4):CD008201. doi: 10.1002/14651858.CD008201.pub3.

本文引用的文献

RuSentiTweet: a sentiment analysis dataset of general domain tweets in Russian.

PeerJ Comput Sci. 2022 Jul 19;8:e1039. doi: 10.7717/peerj-cs.1039. eCollection 2022.

2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.

The measurement of personal values in survey research: a test of alternative rating procedures.

Public Opin Q. 2000 Fall;64(3):271-98. doi: 10.1086/317989.

Automated readability index.

AMRL TR. 1967 May:1-14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ReadMe++：用于多领域可读性评估的多语言模型基准测试

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献