Jia Shuyue, Bit Subhrangshu, Searls Edward, Lauber Meagan V, Fan Pengrui, Wang William M, Claus Lindsey A, Jasodanand Varuna H, Veerapaneni Divya, Au Rhoda, Kolachalama Vijaya B
Department of Electrical & Computer Engineering, Boston University, Boston, MA USA.
Department of Computer Science, Boston University, Boston, MA USA.
NPJ Biomed Innov. 2025;2(1):26. doi: 10.1038/s44385-025-00022-0. Epub 2025 Jul 7.
The proliferation of scientific podcasts has generated an extensive repository of educational content, rich in specialized terminology, diverse topics, and expert dialogues. Here, we introduce a computational framework designed to enhance large language models by leveraging this informational content from publicly accessible audio podcasts across science, technology, engineering, mathematics, and medicine (STEMM). This dataset, comprising over 3700 hours of audio content, was transcribed to generate over 42 million text tokens. Our model, PodGPT, integrates this wealth of complex dialogue found in audio podcasts to improve understanding of natural language nuances, cultural contexts, as well as scientific and medical knowledge. PodGPT also employs retrieval augmented generation (RAG) on a vector database, providing real-time access to emerging scientific literature. Evaluated on multiple benchmarks, PodGPT demonstrated an average improvement of 1.82 percentage points over standard open-source benchmarks and 2.43 percentage points when augmented with evidence from the RAG pipeline. Moreover, it showcased an average improvement of 1.18 percentage points in its zero-shot multilingual transfer ability, effectively generalizing to different linguistic contexts. By harnessing the untapped potential of podcast content, PodGPT advances natural language processing and conversational AI, offering enhanced capabilities for STEMM research and education.
科学播客的激增产生了大量教育内容库,其中富含专业术语、多样的主题以及专家对话。在此,我们介绍一个计算框架,旨在通过利用来自科学、技术、工程、数学和医学(STEMM)领域公开可用音频播客的信息内容来增强大语言模型。这个数据集包含超过3700小时的音频内容,经过转录生成了超过4200万个文本标记。我们的模型PodGPT整合了音频播客中丰富的复杂对话,以提高对自然语言细微差别、文化背景以及科学和医学知识的理解。PodGPT还在向量数据库上采用检索增强生成(RAG),提供对新兴科学文献的实时访问。在多个基准测试中进行评估时,PodGPT在标准开源基准上平均提高了1.82个百分点,在使用RAG管道的证据进行增强时提高了2.43个百分点。此外,它在零样本多语言迁移能力方面平均提高了1.18个百分点,有效地推广到不同的语言环境。通过利用播客内容尚未开发的潜力,PodGPT推动了自然语言处理和对话式人工智能的发展,为STEMM研究和教育提供了增强的能力。