Jia Shuyue, Bit Subhrangshu, Searls Edward, Lauber Meagan V, Claus Lindsey A, Fan Pengrui, Jasodanand Varuna H, Veerapaneni Divya, Wang William M, Au Rhoda, Kolachalama Vijaya B
medRxiv. 2024 Nov 27:2024.07.11.24310304. doi: 10.1101/2024.07.11.24310304.
The proliferation of scientific podcasts has generated an extensive repository of audio content, rich in specialized terminology, diverse topics, and expert dialogues. Here, we introduce a computational framework designed to enhance large language models (LLMs) by leveraging this informational content from publicly accessible podcast data across science, technology, engineering, mathematics and medical (STEMM) disciplines. This dataset, comprising over 3, 700 hours of audio content, was transcribed to generate over 42 million text tokens. Our model, PodGPT, integrates this wealth of complex dialogue found in audio podcasts to improve understanding of natural language nuances, cultural contexts, as well as scientific and medical knowledge. PodGPT also employs retrieval augmented generation (RAG) on a vector database built from articles in Creative Commons PubMed Central and , enhancing STEMM research and education by providing real-time access to emerging scientific literature. Evaluated across multiple benchmarks, PodGPT demonstrated an average improvement of 3.51 percentage points over standard open-source benchmarks and 3.81 percentage points when augmented with evidence from the RAG pipeline. Moreover, it showcased an average improvement of 4.06 percentage points in its zero-shot multi-lingual transfer ability, effectively generalizing to different linguistic contexts. By harnessing the untapped potential of podcast content, PodGPT advances natural language processing and conversational AI, offering enhanced capabilities for STEMM research and education.
科学播客的激增产生了大量音频内容库,其中富含专业术语、多样的主题和专家对话。在此,我们介绍一个计算框架,旨在通过利用来自科学、技术、工程、数学和医学(STEMM)学科的公开可用播客数据中的信息内容来增强大语言模型(LLMs)。这个数据集包含超过3700小时的音频内容,经过转录生成了超过4200万个文本标记。我们的模型PodGPT整合了音频播客中丰富的复杂对话,以提高对自然语言细微差别、文化背景以及科学和医学知识的理解。PodGPT还在由知识共享公共医学中心(Creative Commons PubMed Central)的文章构建的向量数据库上采用检索增强生成(RAG),通过提供对新兴科学文献的实时访问来加强STEMM研究和教育。在多个基准测试中进行评估时,PodGPT在标准开源基准上平均提高了3.51个百分点,在使用RAG管道的证据进行增强时提高了3.81个百分点。此外,它在零样本多语言迁移能力方面平均提高了4.06个百分点,有效地推广到不同的语言环境。通过利用播客内容未开发的潜力,PodGPT推动了自然语言处理和对话式人工智能的发展,为STEMM研究和教育提供了增强的能力。