IEEE Trans Cybern. 2019 May;49(5):1932-1943. doi: 10.1109/TCYB.2018.2816984. Epub 2018 Apr 2.
Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data.
类别标签是监督学习所必需的,但在各种应用中可能会被损坏或缺失。例如,在二进制分类中,当只有一部分正例被标记,而其余的未标记时,需要进行正例-未标记 (PU) 学习,以便从正例和未标记的数据中进行建模。同样,当类别标签被错误标记的实例损坏时,需要在存在类别标签噪声 (LN) 的情况下学习的方法。在这里,我们提出了自适应采样 (AdaSampling),这是一种用于 PU 学习和具有类别 LN 学习的框架。通过使用自适应采样过程迭代地估计类别错误标记概率,所提出的方法逐步降低了为模型训练选择错误标记实例的风险,从而即使在数据中存在大量错误标记的实例的情况下,也能构建高度泛化的模型。我们使用模拟和基准数据演示了所提出的方法的效用,并将其与常用于 PU 学习和/或具有 LN 学习的替代方法进行了比较。然后,我们介绍了两个新的生物信息学应用,其中 AdaSampling 用于:1) 从基于质谱的磷酸化蛋白质组学数据中识别激酶-底物,2) 通过整合各种下一代测序数据来预测转录因子靶基因。