Wu Weifei
Beijing Institute of Remote Sensing Equipment, Beijing, 100084 People's Republic of China.
Neural Process Lett. 2022;54(6):4921-4950. doi: 10.1007/s11063-022-10841-6. Epub 2022 May 7.
Transfer learning has ability to create learning task of weakly labeled or unlabeled target domain by using knowledge of source domain to help, which can effectively improve the performance of target learning task. At present, the increased awareness of privacy protection restricts access to data sources and poses new challenges to the development of transfer learning. However, the research on privacy protection in transfer learning is very rare. The existing work mainly uses differential privacy technology and does not consider the distribution difference between data sources, or does not consider the conditional probability distribution of data, which causes negative transfer to harm the effect of algorithm. Therefore, this paper proposes multi-source selection transfer learning algorithm with privacy-preserving MultiSTLP, which is used in scenarios where target domain contains unlabeled data sets with only a small amount of group probability information and multiple source domains with a large number of labeled data sets. Group probability means that the class label of each sample in target data set is unknown, but the probability of each class in a given data group is available, and multiple source domains indicate that there are more than two source domains. The number of data set contains more than two data sets of source domain and one data set of target domain. The algorithm adapts to the marginal probability distribution and conditional probability distribution differences between domains, and can protect the privacy of target data and improve classification accuracy by fusing the idea of multi-source transfer learning and group probability into support vector machine. At the same time, it can select the representative dataset in source domains to improve efficiency relied on speeding up the training process of algorithm. Experimental results on several real datasets show the effectiveness of MultiSTLP, and it also has some advantages compared with the state-of-the-art transfer learning algorithm.
迁移学习能够利用源域的知识来帮助创建弱标记或未标记目标域的学习任务,从而有效提高目标学习任务的性能。目前,隐私保护意识的增强限制了对数据源的访问,给迁移学习的发展带来了新的挑战。然而,关于迁移学习中隐私保护的研究非常少见。现有的工作主要使用差分隐私技术,没有考虑数据源之间的分布差异,或者没有考虑数据的条件概率分布,这会导致负迁移损害算法的效果。因此,本文提出了具有隐私保护的多源选择迁移学习算法MultiSTLP,该算法用于目标域包含只有少量组概率信息的未标记数据集以及多个带有大量标记数据集的源域的场景。组概率是指目标数据集中每个样本的类别标签未知,但给定数据组中每个类别的概率是已知的,多个源域表示有两个以上的源域。数据集的数量包含两个以上的源域数据集和一个目标域数据集。该算法适应域之间的边际概率分布和条件概率分布差异,并且通过将多源迁移学习和组概率的思想融合到支持向量机中,可以保护目标数据的隐私并提高分类准确率。同时,它可以在源域中选择有代表性的数据集,依靠加快算法的训练过程来提高效率。在几个真实数据集上的实验结果表明了MultiSTLP的有效性,并且与当前最先进的迁移学习算法相比也具有一些优势。