Suppr超能文献

采用加权评估方法的层次聚类在药物科学中的文本摘要。

Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.

机构信息

Applied Sciences, Lumilytics LLC, 436 N. Main St. #1004, Doylestown, PA, 18901, USA.

Decision Sciences, MResult Corporation, 12 Roosevelt Avenue, Mystic, CT, 06355, USA.

出版信息

Sci Rep. 2024 Aug 30;14(1):20149. doi: 10.1038/s41598-024-70618-w.

Abstract

In the pharmaceutical industry, there is an abundance of regulatory documents used to understand the current regulatory landscape and proactively make project decisions. Due to the size of these documents, it is helpful for project teams to have informative summaries. We propose a novel solution, MedicoVerse, to summarize such documents using advanced machine learning techniques. MedicoVerse uses a multi-stage approach, combining word embeddings using the SapBERT model on regulatory documents. These embeddings are put through a critical hierarchical agglomerative clustering step, and the clusters are organized through a custom data structure. Each cluster is summarized using the bart-large-cnn-samsum model, and each summary is merged to create a comprehensive summary of the original document. We compare MedicoVerse results with established models T5, Google Pegasus, Facebook BART, and large language models such as Mixtral 8 7b instruct, GPT 3.5, and Llama-2-70b by introducing a scoring system that considers four factors: ROUGE score, BERTScore, business entities and the Flesch Reading Ease. Our results show that MedicoVerse outperforms the compared models, thus producing informative summaries of large regulatory documents.

摘要

在制药行业,有大量的监管文件用于了解当前的监管格局,并前瞻性地做出项目决策。由于这些文件的规模庞大,对于项目团队来说,有内容丰富的摘要会很有帮助。我们提出了一种新的解决方案 MedicoVerse,使用先进的机器学习技术对这些文档进行总结。MedicoVerse 使用多阶段方法,使用 SapBERT 模型对监管文件进行词嵌入。这些嵌入经过关键的层次凝聚聚类步骤,并通过自定义数据结构对聚类进行组织。每个聚类都使用 bart-large-cnn-samsum 模型进行总结,然后合并每个摘要,以创建原始文档的综合摘要。我们通过引入一个考虑四个因素的评分系统来比较 MedicoVerse 的结果与 T5、Google Pegasus、Facebook BART 和大型语言模型如 Mixtral 8 7b instruct、GPT 3.5 和 Llama-2-70b:ROUGE 分数、BERTScore、业务实体和 Flesch 阅读舒适度。我们的结果表明,MedicoVerse 优于比较模型,从而为大型监管文件生成了内容丰富的摘要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fa6/11362166/f00981d51266/41598_2024_70618_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验