• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物医学文摘的自然语言适应数据集。

A dataset for plain language adaptation of biomedical abstracts.

机构信息

Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

出版信息

Sci Data. 2023 Jan 4;10(1):8. doi: 10.1038/s41597-022-01920-3.

DOI:10.1038/s41597-022-01920-3
PMID:36599892
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9811873/
Abstract

Though exponentially growing health-related literature has been made available to a broad audience online, the language of scientific articles can be difficult for the general public to understand. Therefore, adapting this expert-level language into plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation. Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the first manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with state-of-the-art Deep Learning approaches, setting baselines for future research.

摘要

尽管在线上提供了大量与健康相关的文献,但医学文章的语言对于普通大众来说可能难以理解。因此,将这种专家级语言改编成通俗易懂的版本对于公众可靠地理解大量的健康相关文献是必要的。自动适应的深度学习算法是一种可能的解决方案;然而,需要黄金标准数据集进行适当的评估。迄今为止提出的数据集要么是可比的专业和面向公众的文档对,要么是从这些文档中挖掘的语义相似的句子对。这导致了不完美的对齐和小的测试集之间的权衡。为了解决这个问题,我们创建了生物医学摘要的自然语言处理数据集。这个数据集是第一个手动适应的数据集,它同时进行文档和句子对齐。该数据集包含 750 篇改编摘要,共计 7643 个句子对。除了描述数据集之外,我们还使用最先进的深度学习方法在数据集上进行了自动适应的基准测试,为未来的研究设定了基线。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/06f5b798875d/41597_2022_1920_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/439ba9ca844c/41597_2022_1920_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/20c0ca2725d9/41597_2022_1920_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/4ecdf98c033e/41597_2022_1920_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/06f5b798875d/41597_2022_1920_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/439ba9ca844c/41597_2022_1920_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/20c0ca2725d9/41597_2022_1920_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/4ecdf98c033e/41597_2022_1920_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/965a/9812971/06f5b798875d/41597_2022_1920_Fig4_HTML.jpg

相似文献

1
A dataset for plain language adaptation of biomedical abstracts.生物医学文摘的自然语言适应数据集。
Sci Data. 2023 Jan 4;10(1):8. doi: 10.1038/s41597-022-01920-3.
2
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.基于生物医学语料库预训练的句子嵌入的深度学习提高了在电子病历中查找相似句子的性能。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.
3
Parallel Sentence Alignment from Biomedical Comparable Corpora.生物医学可比语料库中的平行句子对齐
Stud Health Technol Inform. 2020 Jun 16;270:362-366. doi: 10.3233/SHTI200183.
4
Fast and scalable neural embedding models for biomedical sentence classification.用于生物医学句子分类的快速可扩展神经嵌入模型。
BMC Bioinformatics. 2018 Dec 22;19(1):541. doi: 10.1186/s12859-018-2496-4.
5
Deep learning to refine the identification of high-quality clinical research articles from the biomedical literature: Performance evaluation.深度学习改进生物医学文献中高质量临床研究文章的识别:性能评估。
J Biomed Inform. 2023 Jun;142:104384. doi: 10.1016/j.jbi.2023.104384. Epub 2023 May 8.
6
Are plain language summaries more readable than scientific abstracts? Evidence from six biomedical and life sciences journals.通俗易懂的摘要比科学摘要更具可读性吗?来自六家生物医学和生命科学期刊的证据。
Public Underst Sci. 2025 Jan;34(1):114-126. doi: 10.1177/09636625241252565. Epub 2024 May 24.
7
Improving extractive document summarization with sentence centrality.提高抽取式文档摘要的句子中心度。
PLoS One. 2022 Jul 22;17(7):e0268278. doi: 10.1371/journal.pone.0268278. eCollection 2022.
8
Neural sentence embedding models for semantic similarity estimation in the biomedical domain.生物医学领域中语义相似度估计的神经句子嵌入模型。
BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.
9
SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset.SNLI印尼语版:一个源自斯坦福自然语言推理数据集的印尼语文本蕴含识别数据集。
Data Brief. 2023 Dec 21;52:109998. doi: 10.1016/j.dib.2023.109998. eCollection 2024 Feb.
10
A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis.一项使用基于注意力的深度神经阅读器进行生物医学文本理解的初步研究:设计与实验分析。
JMIR Med Inform. 2018 Jan 5;6(1):e2. doi: 10.2196/medinform.8751.

引用本文的文献

1
Sentence-Aligned Simplification of Biomedical Abstracts.生物医学摘要的句子对齐简化
Artif Intell Med Conf Artif Intell Med (2005-). 2024;14844:322-333. doi: 10.1007/978-3-031-66538-7_32. Epub 2024 Jul 25.
2
Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement.使用大语言模型进行本体丰富:将基于词汇、语义和知识网络的相似性应用于概念放置。
J Biomed Inform. 2025 Aug;168:104865. doi: 10.1016/j.jbi.2025.104865. Epub 2025 Jun 19.
3
A Dataset of Medical Questions Paired with Automatically Generated Answers and Evidence-supported References.

本文引用的文献

1
A survey of automated methods for biomedical text simplification.生物医学文本简化的自动化方法调查。
J Am Med Inform Assoc. 2022 Oct 7;29(11):1976-1988. doi: 10.1093/jamia/ocac149.
2
Towards Zero-Shot Conditional Summarization with Adaptive Multi-Task Fine-Tuning.通过自适应多任务微调实现零样本条件摘要
Proc Conf Empir Methods Nat Lang Process. 2020 Nov;2020:3215-3226.
3
Flight of the PEGASUS? Comparing Transformers on Few-Shot and Zero-Shot Multi-document Abstractive Summarization.飞马座的飞行?少样本和零样本多文档摘要生成任务中Transformer模型的比较
一个包含医学问题以及自动生成答案和有证据支持的参考文献的数据集。
Sci Data. 2025 Jun 19;12(1):1035. doi: 10.1038/s41597-025-05233-z.
4
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.APPLS:评估用于平实语言摘要的评估指标
Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:9194-9211. doi: 10.18653/v1/2024.emnlp-main.519.
5
Improving Biomedical Science Literacy and Patient-Directed Knowledge of Tuberculosis (TB): A Cross-Sectional Infodemiology Study Examining Readability of Patient-Facing TB Information.提高生物医学科学素养和结核病(TB)患者导向知识:一项横断面传染病学研究,调查面向患者的 TB 信息的易读性。
Br J Biomed Sci. 2024 Oct 22;81:13566. doi: 10.3389/bjbs.2024.13566. eCollection 2024.
6
Harnessing large language models' zero-shot and few-shot learning capabilities for regulatory research.利用大型语言模型的零样本和少样本学习能力进行监管研究。
Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae354.
7
Retrieval augmentation of large language models for lay language generation.大语言模型的检索增强用于生成通俗语言。
J Biomed Inform. 2024 Jan;149:104580. doi: 10.1016/j.jbi.2023.104580. Epub 2023 Dec 30.
Proc Int Conf Comput Ling. 2020 Dec;2020:5640-5646.
4
Question-driven summarization of answers to consumer health questions.面向消费者健康问题答案的问题驱动式总结。
Sci Data. 2020 Oct 2;7(1):322. doi: 10.1038/s41597-020-00667-z.
5
Online patient information from radiation oncology departments is too complex for the general population.放疗科的在线患者信息对于普通民众来说过于复杂。
Pract Radiat Oncol. 2017 Jan-Feb;7(1):57-62. doi: 10.1016/j.prro.2016.07.008. Epub 2016 Aug 1.
6
A new readability yardstick.一种新的可读性衡量标准。
J Appl Psychol. 1948 Jun;32(3):221-33. doi: 10.1037/h0057532.
7
Plain language: a strategic response to the health literacy challenge.通俗易懂的语言:应对健康素养挑战的战略举措。
J Public Health Policy. 2007;28(1):71-93. doi: 10.1057/palgrave.jphp.3200102.
8
Two biomedical sublanguages: a description based on the theories of Zellig Harris.两种生物医学子语言:基于泽利格·哈里斯理论的一种描述
J Biomed Inform. 2002 Aug;35(4):222-35. doi: 10.1016/s1532-0464(03)00012-1.