Mirkes Evgeny M, Bac Jonathan, Fouché Aziz, Stasenko Sergey V, Zinovyev Andrei, Gorban Alexander N
School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK.
Institut Curie, PSL Research University, 75005 Paris, France.
Entropy (Basel). 2022 Dec 24;25(1):33. doi: 10.3390/e25010033.
Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains.
域适应是现代机器学习中的一种流行范式,旨在解决有标签的训练和验证数据集(源域)与潜在的大型无标签数据集(目标域)之间的差异(或偏移)问题。任务是将两个数据集嵌入到一个公共空间中,在这个空间里源数据集对训练有参考价值,同时源域和目标域之间的差异最小化。最流行的域适应解决方案是基于训练结合了分类和对抗学习模块的神经网络,这常常使它们既需要大量数据又难以训练。我们提出了一种称为域适应主成分分析(DAPCA)的方法,该方法可识别出有助于解决域适应任务的线性降维数据表示。DAPCA算法在数据点对之间引入正权重和负权重,并推广了主成分分析的监督扩展。DAPCA是一种迭代算法,在每次迭代时解决一个简单的二次优化问题。该算法的收敛性有保证,并且在实际中迭代次数较少。我们在先前提出的用于解决域适应任务的基准上验证了所建议的算法。我们还展示了在生物医学应用中使用DAPCA分析单细胞组学数据集的优势。总体而言,考虑到源域和目标域之间可能存在的差异,DAPCA可以作为许多机器学习应用中的一个实用预处理步骤,从而减少数据集的表示。