Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Sci Data. 2023 Jan 4;10(1):8. doi: 10.1038/s41597-022-01920-3.
Though exponentially growing health-related literature has been made available to a broad audience online, the language of scientific articles can be difficult for the general public to understand. Therefore, adapting this expert-level language into plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation. Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the first manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with state-of-the-art Deep Learning approaches, setting baselines for future research.
尽管在线上提供了大量与健康相关的文献,但医学文章的语言对于普通大众来说可能难以理解。因此,将这种专家级语言改编成通俗易懂的版本对于公众可靠地理解大量的健康相关文献是必要的。自动适应的深度学习算法是一种可能的解决方案;然而,需要黄金标准数据集进行适当的评估。迄今为止提出的数据集要么是可比的专业和面向公众的文档对,要么是从这些文档中挖掘的语义相似的句子对。这导致了不完美的对齐和小的测试集之间的权衡。为了解决这个问题,我们创建了生物医学摘要的自然语言处理数据集。这个数据集是第一个手动适应的数据集,它同时进行文档和句子对齐。该数据集包含 750 篇改编摘要,共计 7643 个句子对。除了描述数据集之外,我们还使用最先进的深度学习方法在数据集上进行了自动适应的基准测试,为未来的研究设定了基线。