Sun Yicheng, Wang Yi, Yang Hanbo, Suen Richard
School of Mechanical and Precision Instrument Engineering, Xi'an University of Technology, Xi'an, China.
Faculty of Management, Shenzhen MSU-BIT University, Shenzhen, China.
Sci Rep. 2025 Jul 7;15(1):24312. doi: 10.1038/s41598-025-10085-z.
Human writing often exhibits a range of styles and levels of sophistication. However, automated text generation systems typically lack the nuanced understanding required to produce refined and elegant prose. Due to the inherent one-to-many relationship between inputs and outputs in natural language generation tasks, achieving annotator consistency is challenging. This complexity makes the annotation process considerably more difficult compared to tasks focused on natural language understanding. Our study focuses on the typical task of text refinement, which faces annotation difficulties, aiming to generate sentences with more elegant expressions while preserving the original semantics of the input sentence. This paper proposes a semi-automatic data construction method that combines auto-generation with human judgment. Initially, this method translates collected sentences containing elegant expressions into ordinary expressions through back translation. Subsequently, in an iterative quality control process, data filtering and human judgment are introduced to screen the auto-generated data based on quality standards, resulting in a large-scale text refinement dataset. By replacing manual annotation with human judgment and involving only a small amount of data for human judgment in each iteration, this method significantly reduces annotation difficulty and workload. With minimal human effort, it acquires a substantial amount of labeled data for text refinement, laying a foundation for further research in the field.
人类写作往往展现出一系列风格和复杂程度。然而,自动文本生成系统通常缺乏生成精致优美散文所需的细微理解。由于自然语言生成任务中输入与输出之间固有的一对多关系,实现注释者的一致性具有挑战性。与专注于自然语言理解的任务相比,这种复杂性使得注释过程困难得多。我们的研究聚焦于面临注释困难的文本优化典型任务,旨在生成表达更优美的句子,同时保留输入句子的原始语义。本文提出一种将自动生成与人工判断相结合的半自动数据构建方法。最初,该方法通过反向翻译将收集到的包含优美表达的句子转换为普通表达。随后,在迭代质量控制过程中,引入数据过滤和人工判断,根据质量标准筛选自动生成的数据,从而得到一个大规模的文本优化数据集。通过用人工判断取代人工注释,且每次迭代仅涉及少量数据进行人工判断,该方法显著降低了注释难度和工作量。以最少的人力,它获取了大量用于文本优化的标注数据,为该领域的进一步研究奠定了基础。