Suppr超能文献

基于正例未标注学习的自然选择半监督检测

Semi-supervised detection of natural selection with positive-unlabeled learning.

作者信息

Arnab Sandipan Paul, Campelo Dos Santos Andre Luiz, Fumagalli Matteo, DeGiorgio Michael

出版信息

bioRxiv. 2025 Aug 18:2025.08.15.670602. doi: 10.1101/2025.08.15.670602.

Abstract

Identifying genomic regions shaped by natural selection is a central goal in evolutionary ge-nomics. Existing machine learning methods for this task are typically trained using simulated genomic data labeled according to specific evolutionary scenarios. While effective in controlled settings, these models are limited by their reliance on explicit class labels. They can only detect the specific processes they were trained to recognize, making it difficult to interpret predictions for regions influenced by other evolutionary forces. This limitation is especially problematic when analyzing empirical genomes shaped by a mixture of adaptive, demographic, and ecolog-ical factors. One-vs.-rest strategies offer a potential alternative, but suffer from the inherent complexity of modeling all other evolutionary and demographic processes as a catch-all "rest" class. Here, we explore positive-unlabeled learning as a more flexible framework for detection of adaptive events. Positive-unlabeled learning is a semi-supervised approach that permits iden-tification of samples of a target class using only positive labels and an unlabeled background, without requiring explicit modeling of negative samples. To assess the utility of this approach, we focus on a binary classification setting for detecting selective sweeps (positive samples) aris-ing from positive natural selection against a mixed background composed of both unlabeled sweeps and neutrally-evolving regions. To accomplish this goal, we introduce , a method that employs only a set of labeled sweep observations for training while treating all remaining data as unlabeled. By avoiding assumptions about the composition of the background, enables robust sweep discovery in realistic genomic landscapes. We systematically evaluate its performance across a range of demographic, adaptive, and confounding contexts, including domain shift arising from misspecified demographic models, and find that delivers high performance and generalizability. To demonstrate a practical application of , we ana-lyzed European and Bengali in Bangladesh human genomes, treating the empirical genomes as unlabeled sets, and recapitulating several previously-identified sweep candidates. Our results show that provides a powerful and versatile alternative for detecting adaptive regions, with the potential to generalize across a range of genomic landscapes.

摘要

识别由自然选择塑造的基因组区域是进化基因组学的核心目标。现有的用于此任务的机器学习方法通常使用根据特定进化场景标记的模拟基因组数据进行训练。虽然在受控环境中有效,但这些模型受到对明确类别标签依赖的限制。它们只能检测其训练来识别的特定过程,使得难以解释受其他进化力量影响的区域的预测。当分析由适应性、人口统计学和生态因素混合塑造的经验基因组时,这种限制尤其成问题。一对多策略提供了一种潜在的替代方法,但存在将所有其他进化和人口统计学过程建模为一个包罗万象的“其他”类别的固有复杂性。在这里,我们探索正例未标记学习作为检测适应性事件的更灵活框架。正例未标记学习是一种半监督方法,它允许仅使用正标签和未标记背景来识别目标类别的样本,而无需对负样本进行明确建模。为了评估这种方法的效用,我们专注于一个二元分类设置,用于检测在由未标记的扫描和中性进化区域组成的混合背景下,由正向自然选择引起的选择性扫描(正样本)。为了实现这一目标,我们引入了 ,一种仅使用一组标记的扫描观测进行训练,同时将所有其余数据视为未标记的方法。通过避免对背景组成的假设, 能够在现实的基因组景观中进行稳健的扫描发现。我们在一系列人口统计学、适应性和混杂背景下系统地评估了其性能,包括由于错误指定的人口模型引起的域转移,发现 具有高性能和通用性。为了展示 的实际应用,我们分析了孟加拉国的欧洲人和孟加拉人的人类基因组,将经验基因组视为未标记集,并概括了几个先前确定的扫描候选者。我们的结果表明, 为检测适应性区域提供了一种强大且通用的替代方法,有可能在一系列基因组景观中进行推广。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/969d/12393379/ae6f601e9a31/nihpp-2025.08.15.670602v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验