采用加权评估方法的层次聚类在药物科学中的文本摘要。

Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.

机构信息

Applied Sciences, Lumilytics LLC, 436 N. Main St. #1004, Doylestown, PA, 18901, USA.

Decision Sciences, MResult Corporation, 12 Roosevelt Avenue, Mystic, CT, 06355, USA.

出版信息

Sci Rep. 2024 Aug 30;14(1):20149. doi: 10.1038/s41598-024-70618-w.

DOI:10.1038/s41598-024-70618-w

PMID:39209906

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11362166/

Abstract

In the pharmaceutical industry, there is an abundance of regulatory documents used to understand the current regulatory landscape and proactively make project decisions. Due to the size of these documents, it is helpful for project teams to have informative summaries. We propose a novel solution, MedicoVerse, to summarize such documents using advanced machine learning techniques. MedicoVerse uses a multi-stage approach, combining word embeddings using the SapBERT model on regulatory documents. These embeddings are put through a critical hierarchical agglomerative clustering step, and the clusters are organized through a custom data structure. Each cluster is summarized using the bart-large-cnn-samsum model, and each summary is merged to create a comprehensive summary of the original document. We compare MedicoVerse results with established models T5, Google Pegasus, Facebook BART, and large language models such as Mixtral 8 7b instruct, GPT 3.5, and Llama-2-70b by introducing a scoring system that considers four factors: ROUGE score, BERTScore, business entities and the Flesch Reading Ease. Our results show that MedicoVerse outperforms the compared models, thus producing informative summaries of large regulatory documents.

摘要

在制药行业，有大量的监管文件用于了解当前的监管格局，并前瞻性地做出项目决策。由于这些文件的规模庞大，对于项目团队来说，有内容丰富的摘要会很有帮助。我们提出了一种新的解决方案 MedicoVerse，使用先进的机器学习技术对这些文档进行总结。MedicoVerse 使用多阶段方法，使用 SapBERT 模型对监管文件进行词嵌入。这些嵌入经过关键的层次凝聚聚类步骤，并通过自定义数据结构对聚类进行组织。每个聚类都使用 bart-large-cnn-samsum 模型进行总结，然后合并每个摘要，以创建原始文档的综合摘要。我们通过引入一个考虑四个因素的评分系统来比较 MedicoVerse 的结果与 T5、Google Pegasus、Facebook BART 和大型语言模型如 Mixtral 8 7b instruct、GPT 3.5 和 Llama-2-70b：ROUGE 分数、BERTScore、业务实体和 Flesch 阅读舒适度。我们的结果表明，MedicoVerse 优于比较模型，从而为大型监管文件生成了内容丰富的摘要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fa6/11362166/f00981d51266/41598_2024_70618_Fig1_HTML.jpg

相似文献

Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.采用加权评估方法的层次聚类在药物科学中的文本摘要。

Sci Rep. 2024 Aug 30;14(1):20149. doi: 10.1038/s41598-024-70618-w.

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences.探索 ChatGPT 在医学对话总结中的潜力：一项关于与人类偏好一致性的研究。

BMC Med Inform Decis Mak. 2024 Mar 14;24(1):75. doi: 10.1186/s12911-024-02481-8.

Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.用于量化生物医学文本摘要是信息内容的深度语境化嵌入。

Comput Methods Programs Biomed. 2020 Feb;184:105117. doi: 10.1016/j.cmpb.2019.105117. Epub 2019 Oct 4.

Leveraging Summary Guidance on Medical Report Summarization.利用医疗报告总结中的指导意见。

IEEE J Biomed Health Inform. 2023 Oct;27(10):5066-5075. doi: 10.1109/JBHI.2023.3304376. Epub 2023 Oct 5.

CIBS: A biomedical text summarizer using topic-based sentence clustering.CIBS：一种基于主题的句子聚类的生物医学文本摘要器。

J Biomed Inform. 2018 Dec;88:53-61. doi: 10.1016/j.jbi.2018.11.006. Epub 2018 Nov 13.

Summarization of biomedical articles using domain-specific word embeddings and graph ranking.基于领域特定词嵌入和图排序的生物医学文章摘要。

J Biomed Inform. 2020 Jul;107:103452. doi: 10.1016/j.jbi.2020.103452. Epub 2020 May 19.

Graph-based biomedical text summarization: An itemset mining and sentence clustering approach.基于图的生物医学文本摘要：一种基于项集挖掘和句子聚类的方法。

J Biomed Inform. 2018 Aug;84:42-58. doi: 10.1016/j.jbi.2018.06.005. Epub 2018 Jun 15.

Development and Evaluation of a Digital Scribe: Conversation Summarization Pipeline for Emergency Department Counseling Sessions towards Reducing Documentation Burden.数字书记员的开发与评估：用于急诊科咨询会话的对话摘要流程以减轻文档负担

medRxiv. 2023 Dec 7:2023.12.06.23299573. doi: 10.1101/2023.12.06.23299573.

CERC: an interactive content extraction, recognition, and construction tool for clinical and biomedical text.CERC：一个用于临床和生物医学文本的交互式内容提取、识别和构建工具。

BMC Med Inform Decis Mak. 2020 Dec 15;20(Suppl 14):306. doi: 10.1186/s12911-020-01330-8.

Leveraging GPT-4 for food effect summarization to enhance product-specific guidance development via iterative prompting.利用GPT-4进行食物效应总结，通过迭代提示增强特定产品指南的制定。

J Biomed Inform. 2023 Dec;148:104533. doi: 10.1016/j.jbi.2023.104533. Epub 2023 Nov 2.

引用本文的文献

Sustainable rural development: differentiated paths to achieve rural revitalization with case of Western China.可持续乡村发展：以中国西部为例实现乡村振兴的差异化路径

Sci Rep. 2024 Dec 28;14(1):31507. doi: 10.1038/s41598-024-83339-x.

本文引用的文献

A large-scaled corpus for assessing text readability.用于评估文本可读性的大规模语料库。

Behav Res Methods. 2023 Feb;55(2):491-507. doi: 10.3758/s13428-022-01802-x. Epub 2022 Mar 16.

An objective analysis of quality and readability of online information on COVID-19.关于新冠病毒病在线信息的质量与可读性的客观分析。

Health Technol (Berl). 2021;11(5):1093-1099. doi: 10.1007/s12553-021-00574-2. Epub 2021 Jun 24.

Clinical Text Data in Machine Learning: Systematic Review.机器学习中的临床文本数据：系统综述

JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.

Deep Reinforcement Learning for Sequence-to-Sequence Models.深度强化学习在序列到序列模型中的应用。

IEEE Trans Neural Netw Learn Syst. 2020 Jul;31(7):2469-2489. doi: 10.1109/TNNLS.2019.2929141. Epub 2019 Aug 15.

Exploring PubMed as a reliable resource for scholarly communications services.探索将PubMed作为学术交流服务的可靠资源。

J Med Libr Assoc. 2019 Jan;107(1):16-29. doi: 10.5195/jmla.2019.433. Epub 2019 Jan 1.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Towards PubMed 2.0.迈向 PubMed 2.0。

Elife. 2017 Oct 30;6:e28801. doi: 10.7554/eLife.28801.

Extractive text summarization system to aid data extraction from full text in systematic review development.用于从系统综述开发的全文中辅助数据提取的抽取式文本摘要系统。

J Biomed Inform. 2016 Dec;64:265-272. doi: 10.1016/j.jbi.2016.10.014. Epub 2016 Oct 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

采用加权评估方法的层次聚类在药物科学中的文本摘要。

Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献