Bioinformatics Research Group (BioRG), Florida International University, 11200 SW 8th St, Miami, 33199, FL, USA.
Department of Epidemiology, Florida International University, 11200 SW 8th St, Miami, 24105, FL, USA.
BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):2. doi: 10.1186/s12859-019-3310-7.
Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA).
We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda CONCLUSIONS: Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.
偏最小二乘判别分析(PLS-DA)是一种流行的机器学习工具,作为一种有用的特征选择器和分类器,越来越受到关注。为了了解它的优缺点,我们用合成数据进行了一系列实验,并将其性能与其最初发明的近亲主成分分析(PCA)进行了比较。
我们证明,尽管 PCA 忽略了样本类标签的信息,但作为一种特征选择器,这种无监督工具可以非常有效。在某些情况下,它的性能优于 PLS-DA,后者在输入中了解类标签。我们的实验范围从特征选择任务中的信噪比,到考虑分析生物信息学和临床数据时遇到的许多实际分布和模型。还评估了其他方法。最后,我们分析了一个有趣的来自 396 个阴道微生物组样本的数据集,其中特征选择的真实情况是可用的。本文显示的所有 3D 图以及补充图都可以在 http://biorg.cs.fiu.edu/plsda 上交互式查看。
我们的结果突出了 PLS-DA 与 PCA 相比在不同基础数据模型下的优缺点。