AdaSampling 用于生物信息学中带正例无负例和带标签噪声学习

AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications.

出版信息

IEEE Trans Cybern. 2019 May;49(5):1932-1943. doi: 10.1109/TCYB.2018.2816984. Epub 2018 Apr 2.

DOI:10.1109/TCYB.2018.2816984

Abstract

Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.

摘要

类别标签是监督学习所必需的，但在各种应用中可能会被损坏或缺失。例如，在二进制分类中，当只有一部分正例被标记，而其余的未标记时，需要进行正例-未标记 (PU) 学习，以便从正例和未标记的数据中进行建模。同样，当类别标签被错误标记的实例损坏时，需要在存在类别标签噪声 (LN) 的情况下学习的方法。在这里，我们提出了自适应采样 (AdaSampling)，这是一种用于 PU 学习和具有类别 LN 学习的框架。通过使用自适应采样过程迭代地估计类别错误标记概率，所提出的方法逐步降低了为模型训练选择错误标记实例的风险，从而即使在数据中存在大量错误标记的实例的情况下，也能构建高度泛化的模型。我们使用模拟和基准数据演示了所提出的方法的效用，并将其与常用于 PU 学习和/或具有 LN 学习的替代方法进行了比较。然后，我们介绍了两个新的生物信息学应用，其中 AdaSampling 用于：1) 从基于质谱的磷酸化蛋白质组学数据中识别激酶-底物，2) 通过整合各种下一代测序数据来预测转录因子靶基因。

相似文献

AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications.AdaSampling 用于生物信息学中带正例无负例和带标签噪声学习

IEEE Trans Cybern. 2019 May;49(5):1932-1943. doi: 10.1109/TCYB.2018.2816984. Epub 2018 Apr 2.

Positive-unlabeled learning in bioinformatics and computational biology: a brief review.生物信息学和计算生物学中的正无标记学习：简要综述。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab461.

Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies.估计正例未标记学习中的分类准确率：特征描述与校正策略。

Pac Symp Biocomput. 2019;24:124-135.

Leveraging permutation testing to assess confidence in positive-unlabeled learning applied to high-dimensional biological datasets.利用排列检验评估正无标签学习在高维生物学数据集上的置信度。

BMC Bioinformatics. 2024 Jun 19;25(1):218. doi: 10.1186/s12859-024-05834-2.

Convex formulation of multiple instance learning from positive and unlabeled bags.从正例和未标记袋中进行多示例学习的凸公式化。

Neural Netw. 2018 Sep;105:132-141. doi: 10.1016/j.neunet.2018.05.001. Epub 2018 May 24.

A network-based positive and unlabeled learning approach for fake news detection.一种基于网络的用于虚假新闻检测的正例与无标签学习方法。

Mach Learn. 2022;111(10):3549-3592. doi: 10.1007/s10994-021-06111-6. Epub 2021 Nov 18.

Predict subcellular locations of singleplex and multiplex proteins by semi-supervised learning and dimension-reducing general mode of Chou's PseAAC.通过半监督学习和 Chou 的 PseAAC 通用模式的降维方法预测单plex 和 multiplex 蛋白质的亚细胞定位。

IEEE Trans Nanobioscience. 2013 Dec;12(4):311-20. doi: 10.1109/TNB.2013.2272014.

Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications.生物信息学应用中不平衡和集成学习问题的样本子集优化技术。

IEEE Trans Cybern. 2014 Mar;44(3):445-55. doi: 10.1109/TCYB.2013.2257480. Epub 2013 Sep 30.

Loss Decomposition and Centroid Estimation for Positive and Unlabeled Learning.用于正例和无标签学习的损失分解与质心估计

IEEE Trans Pattern Anal Mach Intell. 2021 Mar;43(3):918-932. doi: 10.1109/TPAMI.2019.2941684. Epub 2021 Feb 4.

Ensemble positive unlabeled learning for disease gene identification.用于疾病基因识别的集成正无标记学习

PLoS One. 2014 May 9;9(5):e97079. doi: 10.1371/journal.pone.0097079. eCollection 2014.

引用本文的文献

mADP-GCNPUAS: mA-Disease Prediction via Graph Convolutional Network and Positive-Unlabeled Learning with Self-Adaptive Sampling.mADP-GCNPUAS：基于图卷积网络和自适应采样的正例-无标签学习进行疾病预测

Interdiscip Sci. 2025 Aug 30. doi: 10.1007/s12539-025-00760-0.

Bipolar and schizophrenia risk gene encodes an autophagy receptor coupling the regulation of PKA kinase network homeostasis to synaptic transmission.双相情感障碍和精神分裂症风险基因编码一种自噬受体，该受体将蛋白激酶A（PKA）激酶网络稳态调节与突触传递联系起来。

Res Sq. 2025 Mar 13:rs.3.rs-6043477. doi: 10.21203/rs.3.rs-6043477/v1.

Predicting host range expansion in parasitic mites using a global mammalian-acarine dataset.利用全球哺乳动物-螨虫数据集预测寄生螨虫的宿主范围扩张。

Nat Commun. 2024 Jun 26;15(1):5431. doi: 10.1038/s41467-024-49515-3.

AI-guided pipeline for protein-protein interaction drug discovery identifies a SARS-CoV-2 inhibitor.人工智能指导的蛋白质-蛋白质相互作用药物发现管道鉴定出一种 SARS-CoV-2 抑制剂。

Mol Syst Biol. 2024 Apr;20(4):428-457. doi: 10.1038/s44320-024-00019-8. Epub 2024 Mar 11.

Automatic quality control of single-cell and single-nucleus RNA-seq using valiDrops.使用valiDrops对单细胞和单细胞核RNA测序进行自动质量控制。

NAR Genom Bioinform. 2023 Nov 18;5(4):lqad101. doi: 10.1093/nargab/lqad101. eCollection 2023 Dec.

SnapKin: a snapshot deep learning ensemble for kinase-substrate prediction from phosphoproteomics data.SnapKin：一种用于从磷酸化蛋白质组学数据预测激酶-底物的深度学习集成快照方法。

NAR Genom Bioinform. 2023 Nov 6;5(4):lqad099. doi: 10.1093/nargab/lqad099. eCollection 2023 Dec.

A multi-omics integrative analysis based on CRISPR screens re-defines the pluripotency regulatory network in ESCs.基于 CRISPR 筛选的多组学综合分析重新定义了 ESCs 中的多能性调控网络。

Commun Biol. 2023 Apr 14;6(1):410. doi: 10.1038/s42003-023-04700-w.

PLUS: Predicting cancer metastasis potential based on positive and unlabeled learning.PLUS：基于阳性和无标签学习预测癌症转移潜能。

PLoS Comput Biol. 2022 Mar 29;18(3):e1009956. doi: 10.1371/journal.pcbi.1009956. eCollection 2022 Mar.

Co-evolution based machine-learning for predicting functional interactions between human genes.基于共同进化的机器学习预测人类基因之间的功能相互作用。

Nat Commun. 2021 Nov 9;12(1):6454. doi: 10.1038/s41467-021-26792-w.

Protocol for the processing and downstream analysis of phosphoproteomic data with PhosR.PhosR 进行磷酸化蛋白质组学数据处理和下游分析的方案

STAR Protoc. 2021 Jun 5;2(2):100585. doi: 10.1016/j.xpro.2021.100585. eCollection 2021 Jun 18.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

AdaSampling 用于生物信息学中带正例无负例和带标签噪声学习

AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications.

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献