Suppr超能文献

采用加权评估方法的层次聚类在药物科学中的文本摘要。

Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology.

机构信息

Applied Sciences, Lumilytics LLC, 436 N. Main St. #1004, Doylestown, PA, 18901, USA.

Decision Sciences, MResult Corporation, 12 Roosevelt Avenue, Mystic, CT, 06355, USA.

出版信息

Sci Rep. 2024 Aug 30;14(1):20149. doi: 10.1038/s41598-024-70618-w.

Abstract

In the pharmaceutical industry, there is an abundance of regulatory documents used to understand the current regulatory landscape and proactively make project decisions. Due to the size of these documents, it is helpful for project teams to have informative summaries. We propose a novel solution, MedicoVerse, to summarize such documents using advanced machine learning techniques. MedicoVerse uses a multi-stage approach, combining word embeddings using the SapBERT model on regulatory documents. These embeddings are put through a critical hierarchical agglomerative clustering step, and the clusters are organized through a custom data structure. Each cluster is summarized using the bart-large-cnn-samsum model, and each summary is merged to create a comprehensive summary of the original document. We compare MedicoVerse results with established models T5, Google Pegasus, Facebook BART, and large language models such as Mixtral 8 7b instruct, GPT 3.5, and Llama-2-70b by introducing a scoring system that considers four factors: ROUGE score, BERTScore, business entities and the Flesch Reading Ease. Our results show that MedicoVerse outperforms the compared models, thus producing informative summaries of large regulatory documents.

摘要

在制药行业,有大量的监管文件用于了解当前的监管格局,并前瞻性地做出项目决策。由于这些文件的规模庞大,对于项目团队来说,有内容丰富的摘要会很有帮助。我们提出了一种新的解决方案 MedicoVerse,使用先进的机器学习技术对这些文档进行总结。MedicoVerse 使用多阶段方法,使用 SapBERT 模型对监管文件进行词嵌入。这些嵌入经过关键的层次凝聚聚类步骤,并通过自定义数据结构对聚类进行组织。每个聚类都使用 bart-large-cnn-samsum 模型进行总结,然后合并每个摘要,以创建原始文档的综合摘要。我们通过引入一个考虑四个因素的评分系统来比较 MedicoVerse 的结果与 T5、Google Pegasus、Facebook BART 和大型语言模型如 Mixtral 8 7b instruct、GPT 3.5 和 Llama-2-70b:ROUGE 分数、BERTScore、业务实体和 Flesch 阅读舒适度。我们的结果表明,MedicoVerse 优于比较模型,从而为大型监管文件生成了内容丰富的摘要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fa6/11362166/f00981d51266/41598_2024_70618_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验