生物医学文摘的自然语言适应数据集。

A dataset for plain language adaptation of biomedical abstracts.

机构信息

Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

出版信息

Sci Data. 2023 Jan 4;10(1):8. doi: 10.1038/s41597-022-01920-3.

DOI:10.1038/s41597-022-01920-3

PMID:36599892

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9811873/

Abstract

Though exponentially growing health-related literature has been made available to a broad audience online, the language of scientific articles can be difficult for the general public to understand. Therefore, adapting this expert-level language into plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation. Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the first manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with state-of-the-art Deep Learning approaches, setting baselines for future research.

摘要

尽管在线上提供了大量与健康相关的文献，但医学文章的语言对于普通大众来说可能难以理解。因此，将这种专家级语言改编成通俗易懂的版本对于公众可靠地理解大量的健康相关文献是必要的。自动适应的深度学习算法是一种可能的解决方案；然而，需要黄金标准数据集进行适当的评估。迄今为止提出的数据集要么是可比的专业和面向公众的文档对，要么是从这些文档中挖掘的语义相似的句子对。这导致了不完美的对齐和小的测试集之间的权衡。为了解决这个问题，我们创建了生物医学摘要的自然语言处理数据集。这个数据集是第一个手动适应的数据集，它同时进行文档和句子对齐。该数据集包含 750 篇改编摘要，共计 7643 个句子对。除了描述数据集之外，我们还使用最先进的深度学习方法在数据集上进行了自动适应的基准测试，为未来的研究设定了基线。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/439ba9ca844c/41597_2022_1920_Fig1_HTML.jpg

相似文献

A dataset for plain language adaptation of biomedical abstracts.

Sci Data. 2023 Jan 4;10(1):8. doi: 10.1038/s41597-022-01920-3.

Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.

BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.

Parallel Sentence Alignment from Biomedical Comparable Corpora.

Stud Health Technol Inform. 2020 Jun 16;270:362-366. doi: 10.3233/SHTI200183.

Fast and scalable neural embedding models for biomedical sentence classification.

BMC Bioinformatics. 2018 Dec 22;19(1):541. doi: 10.1186/s12859-018-2496-4.

Deep learning to refine the identification of high-quality clinical research articles from the biomedical literature: Performance evaluation.

J Biomed Inform. 2023 Jun;142:104384. doi: 10.1016/j.jbi.2023.104384. Epub 2023 May 8.

Are plain language summaries more readable than scientific abstracts? Evidence from six biomedical and life sciences journals.

Public Underst Sci. 2025 Jan;34(1):114-126. doi: 10.1177/09636625241252565. Epub 2024 May 24.

Improving extractive document summarization with sentence centrality.

PLoS One. 2022 Jul 22;17(7):e0268278. doi: 10.1371/journal.pone.0268278. eCollection 2022.

Neural sentence embedding models for semantic similarity estimation in the biomedical domain.

BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.

SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset.

Data Brief. 2023 Dec 21;52:109998. doi: 10.1016/j.dib.2023.109998. eCollection 2024 Feb.

A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis.

JMIR Med Inform. 2018 Jan 5;6(1):e2. doi: 10.2196/medinform.8751.

引用本文的文献

Sentence-Aligned Simplification of Biomedical Abstracts.

Artif Intell Med Conf Artif Intell Med (2005-). 2024;14844:322-333. doi: 10.1007/978-3-031-66538-7_32. Epub 2024 Jul 25.

Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement.

J Biomed Inform. 2025 Aug;168:104865. doi: 10.1016/j.jbi.2025.104865. Epub 2025 Jun 19.

A Dataset of Medical Questions Paired with Automatically Generated Answers and Evidence-supported References.

Sci Data. 2025 Jun 19;12(1):1035. doi: 10.1038/s41597-025-05233-z.

APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.

Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.

Improving Biomedical Science Literacy and Patient-Directed Knowledge of Tuberculosis (TB): A Cross-Sectional Infodemiology Study Examining Readability of Patient-Facing TB Information.

Br J Biomed Sci. 2024 Oct 22;81:13566. doi: 10.3389/bjbs.2024.13566. eCollection 2024.

Harnessing large language models' zero-shot and few-shot learning capabilities for regulatory research.

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae354.

Retrieval augmentation of large language models for lay language generation.

J Biomed Inform. 2024 Jan;149:104580. doi: 10.1016/j.jbi.2023.104580. Epub 2023 Dec 30.

本文引用的文献

A survey of automated methods for biomedical text simplification.

J Am Med Inform Assoc. 2022 Oct 7;29(11):1976-1988. doi: 10.1093/jamia/ocac149.

Towards Zero-Shot Conditional Summarization with Adaptive Multi-Task Fine-Tuning.

Proc Conf Empir Methods Nat Lang Process. 2020 Nov;2020:3215-3226.

Flight of the PEGASUS? Comparing Transformers on Few-Shot and Zero-Shot Multi-document Abstractive Summarization.

Proc Int Conf Comput Ling. 2020 Dec;2020:5640-5646.

Question-driven summarization of answers to consumer health questions.

Sci Data. 2020 Oct 2;7(1):322. doi: 10.1038/s41597-020-00667-z.

Online patient information from radiation oncology departments is too complex for the general population.

Pract Radiat Oncol. 2017 Jan-Feb;7(1):57-62. doi: 10.1016/j.prro.2016.07.008. Epub 2016 Aug 1.

A new readability yardstick.

J Appl Psychol. 1948 Jun;32(3):221-33. doi: 10.1037/h0057532.

Plain language: a strategic response to the health literacy challenge.

J Public Health Policy. 2007;28(1):71-93. doi: 10.1057/palgrave.jphp.3200102.

Two biomedical sublanguages: a description based on the theories of Zellig Harris.

J Biomed Inform. 2002 Aug;35(4):222-35. doi: 10.1016/s1532-0464(03)00012-1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

生物医学文摘的自然语言适应数据集。

A dataset for plain language adaptation of biomedical abstracts.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr超能文献

生物医学文摘的自然语言适应数据集。

A dataset for plain language adaptation of biomedical abstracts.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Suppr
超能文献