基于正例未标注学习的自然选择半监督检测

Semi-supervised detection of natural selection with positive-unlabeled learning.

作者信息

Arnab Sandipan Paul, Campelo Dos Santos Andre Luiz, Fumagalli Matteo, DeGiorgio Michael

出版信息

bioRxiv. 2025 Aug 18:2025.08.15.670602. doi: 10.1101/2025.08.15.670602.

DOI:10.1101/2025.08.15.670602

PMID:40894649

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12393379/

Abstract

Identifying genomic regions shaped by natural selection is a central goal in evolutionary ge-nomics. Existing machine learning methods for this task are typically trained using simulated genomic data labeled according to specific evolutionary scenarios. While effective in controlled settings, these models are limited by their reliance on explicit class labels. They can only detect the specific processes they were trained to recognize, making it difficult to interpret predictions for regions influenced by other evolutionary forces. This limitation is especially problematic when analyzing empirical genomes shaped by a mixture of adaptive, demographic, and ecolog-ical factors. One-vs.-rest strategies offer a potential alternative, but suffer from the inherent complexity of modeling all other evolutionary and demographic processes as a catch-all "rest" class. Here, we explore positive-unlabeled learning as a more flexible framework for detection of adaptive events. Positive-unlabeled learning is a semi-supervised approach that permits iden-tification of samples of a target class using only positive labels and an unlabeled background, without requiring explicit modeling of negative samples. To assess the utility of this approach, we focus on a binary classification setting for detecting selective sweeps (positive samples) aris-ing from positive natural selection against a mixed background composed of both unlabeled sweeps and neutrally-evolving regions. To accomplish this goal, we introduce , a method that employs only a set of labeled sweep observations for training while treating all remaining data as unlabeled. By avoiding assumptions about the composition of the background, enables robust sweep discovery in realistic genomic landscapes. We systematically evaluate its performance across a range of demographic, adaptive, and confounding contexts, including domain shift arising from misspecified demographic models, and find that delivers high performance and generalizability. To demonstrate a practical application of , we ana-lyzed European and Bengali in Bangladesh human genomes, treating the empirical genomes as unlabeled sets, and recapitulating several previously-identified sweep candidates. Our results show that provides a powerful and versatile alternative for detecting adaptive regions, with the potential to generalize across a range of genomic landscapes.

摘要

识别由自然选择塑造的基因组区域是进化基因组学的核心目标。现有的用于此任务的机器学习方法通常使用根据特定进化场景标记的模拟基因组数据进行训练。虽然在受控环境中有效，但这些模型受到对明确类别标签依赖的限制。它们只能检测其训练来识别的特定过程，使得难以解释受其他进化力量影响的区域的预测。当分析由适应性、人口统计学和生态因素混合塑造的经验基因组时，这种限制尤其成问题。一对多策略提供了一种潜在的替代方法，但存在将所有其他进化和人口统计学过程建模为一个包罗万象的“其他”类别的固有复杂性。在这里，我们探索正例未标记学习作为检测适应性事件的更灵活框架。正例未标记学习是一种半监督方法，它允许仅使用正标签和未标记背景来识别目标类别的样本，而无需对负样本进行明确建模。为了评估这种方法的效用，我们专注于一个二元分类设置，用于检测在由未标记的扫描和中性进化区域组成的混合背景下，由正向自然选择引起的选择性扫描（正样本）。为了实现这一目标，我们引入了，一种仅使用一组标记的扫描观测进行训练，同时将所有其余数据视为未标记的方法。通过避免对背景组成的假设，能够在现实的基因组景观中进行稳健的扫描发现。我们在一系列人口统计学、适应性和混杂背景下系统地评估了其性能，包括由于错误指定的人口模型引起的域转移，发现具有高性能和通用性。为了展示的实际应用，我们分析了孟加拉国的欧洲人和孟加拉人的人类基因组，将经验基因组视为未标记集，并概括了几个先前确定的扫描候选者。我们的结果表明，为检测适应性区域提供了一种强大且通用的替代方法，有可能在一系列基因组景观中进行推广。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/969d/12393379/ae6f601e9a31/nihpp-2025.08.15.670602v1-f0001.jpg

相似文献

Semi-supervised detection of natural selection with positive-unlabeled learning.基于正例未标注学习的自然选择半监督检测

bioRxiv. 2025 Aug 18:2025.08.15.670602. doi: 10.1101/2025.08.15.670602.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Sexual Harassment and Prevention Training性骚扰与预防培训

Healthcare workers' informal uses of mobile phones and other mobile devices to support their work: a qualitative evidence synthesis.医护人员非正规使用手机和其他移动设备来支持工作：定性证据综合评价。

Cochrane Database Syst Rev. 2024 Aug 27;8(8):CD015705. doi: 10.1002/14651858.CD015705.pub2.

Short-Term Memory Impairment短期记忆障碍

Autistic Students' Experiences of Employment and Employability Support while Studying at a UK University.自闭症学生在英国大学学习期间的就业经历及就业支持情况

Autism Adulthood. 2025 Apr 3;7(2):212-222. doi: 10.1089/aut.2024.0112. eCollection 2025 Apr.

The Effect of Labeling During Simulated Contact on Attitudes Toward Autistic Adults.模拟接触过程中的标签对对待成年自闭症患者态度的影响。

Autism Adulthood. 2025 Feb 5;7(1):93-99. doi: 10.1089/aut.2023.0081. eCollection 2025 Feb.

Systemic Inflammatory Response Syndrome全身炎症反应综合征

本文引用的文献

Efficient Detection and Characterization of Targets of Natural Selection Using Transfer Learning.利用迁移学习对自然选择目标进行高效检测与特征描述

Mol Biol Evol. 2025 Apr 30;42(5). doi: 10.1093/molbev/msaf094.

Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption.非随机选择的正无标记学习（PULSNAR）：无需完全随机选择假设的类比例估计。

PeerJ Comput Sci. 2024 Nov 5;10:e2451. doi: 10.7717/peerj-cs.2451. eCollection 2024.

Emerging trends in cognitive impairment and dementia among older populations in Asia: A systematic review.亚洲老年人群认知障碍和痴呆症的新趋势：系统综述。

J Glob Health. 2024 Nov 8;14:04233. doi: 10.7189/jogh.14.04233.

Cross-disorder and disease-specific pathways in dementia revealed by single-cell genomics.单细胞基因组学揭示痴呆症中的跨疾病和疾病特异性途径。

Cell. 2024 Oct 3;187(20):5753-5774.e28. doi: 10.1016/j.cell.2024.08.019. Epub 2024 Sep 11.

Neurocardiac pathologies associated with potassium channelopathies.与钾通道病相关的神经心脏病理学。

Epilepsia. 2024 Sep;65(9):2537-2552. doi: 10.1111/epi.18066. Epub 2024 Aug 1.

Prevalence of mental disorders in South Asia: A systematic review of reviews.南亚精神障碍的患病率：综述的系统评价

Glob Ment Health (Camb). 2023 Nov 13;10:e78. doi: 10.1017/gmh.2023.72. eCollection 2023.

On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns.基于卷积神经网络的选择推理研究：揭示预处理对模型学习和发现新规律能力的影响。

PLoS Comput Biol. 2023 Nov 27;19(11):e1010979. doi: 10.1371/journal.pcbi.1010979. eCollection 2023 Nov.

Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.基于模拟群体遗传数据的域自适应神经网络提高监督机器学习性能。

PLoS Genet. 2023 Nov 7;19(11):e1011032. doi: 10.1371/journal.pgen.1011032. eCollection 2023 Nov.

Uncovering Footprints of Natural Selection Through Spectral Analysis of Genomic Summary Statistics.通过基因组汇总统计的光谱分析揭示自然选择的足迹。

Mol Biol Evol. 2023 Jul 5;40(7). doi: 10.1093/molbev/msad157.

Versatile Detection of Diverse Selective Sweeps with Flex-Sweep.利用 Flex-Sweep 实现多种选择清除的灵活检测。

Mol Biol Evol. 2023 Jun 1;40(6). doi: 10.1093/molbev/msad139.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于正例未标注学习的自然选择半监督检测

Semi-supervised detection of natural selection with positive-unlabeled learning.

作者信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献