Suppr超能文献

合并与翻译:一种用于群体遗传学的语言模型

Coalescence and Translation: A Language Model for Population Genetics.

作者信息

Korfmann Kevin, Pope Nathaniel S, Meleghy Melinda, Tellier Aurélien, Kern Andrew D

机构信息

University of Oregon, Institute of Ecology and Evolution, Eugene, USA.

Technical University of Munich, Department of Life Science Systems, Munich, Germany.

出版信息

bioRxiv. 2025 Jun 27:2025.06.24.661337. doi: 10.1101/2025.06.24.661337.

Abstract

Probabilistic models such as the sequentially Markovian coalescent (SMC) have long provided a powerful framework for population genetic inference, enabling reconstruction of demographic history and ancestral relationships from genomic data. However, these methods are inherently specialized, relying on predefined assumptions and/or limited scalability. Recent advances in simulation and deep learning provide an alternative approach: learning directly to generalize from synthetic genetic data to infer specific hidden evolutionary processes. Here we reframe the inference of coalescence times as a problem of translation between two biological languages: the sparse, observable patterns of mutation along the genome and the unobservable ancestral recombination graph (ARG) that gave rise to them. Inspired by large language models, we develop cxt, a decoder-only transformer that autoregressively predicts coalescent events conditioned on local mutational context. We show that cxt performs on par with state-of-the-art MCMC-based likelihood models across a broad range of demographic scenarios, including both in-distribution and out-of-distribution settings. Trained on simulations spanning the stdpopsim catalog, the model generalizes robustly and enables efficient inference at scale, producing over a million coalescence predictions in minutes. In addition cxt produces a well calibrated approximate posterior distribution of its predictions, enabling principled uncertainty quantification. Our work moves towards a foundation model for population genetics, bridging deep learning and coalescent theory to enable flexible, scalable inference of genealogical history from genomic data.

摘要

诸如序列马尔可夫合并(SMC)等概率模型长期以来为群体遗传学推断提供了一个强大的框架,能够从基因组数据重建人口历史和祖先关系。然而,这些方法本质上是专门化的,依赖于预定义的假设和/或有限的可扩展性。模拟和深度学习的最新进展提供了另一种方法:直接学习从合成遗传数据中进行概括,以推断特定的隐藏进化过程。在这里,我们将合并时间的推断重新定义为两种生物语言之间的转换问题:基因组上稀疏的、可观察到的突变模式以及产生这些模式的不可观察的祖先重组图(ARG)。受大语言模型的启发,我们开发了cxt,这是一种仅解码器的变换器,它以局部突变背景为条件自回归预测合并事件。我们表明,在广泛的人口统计场景中,包括分布内和分布外设置,cxt的表现与基于马尔可夫链蒙特卡罗(MCMC)的似然模型相当。该模型在跨越stdpopsim目录的模拟上进行训练,具有强大的泛化能力,能够在大规模上进行高效推断,在几分钟内产生超过一百万个合并预测。此外,cxt为其预测生成了一个经过良好校准的近似后验分布,从而实现有原则的不确定性量化。我们的工作朝着群体遗传学的基础模型迈进,将深度学习和合并理论联系起来,以便从基因组数据中灵活、可扩展地推断谱系历史。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b692/12262695/6beb28201104/nihpp-2025.06.24.661337v1-f0002.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验