OmicsWayCorp, Walnut, CA 91788, USA.
Institute for Personailzed Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia.
Int J Mol Sci. 2020 Jan 22;21(3):713. doi: 10.3390/ijms21030713.
(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naïve Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41-235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61-0.88 range to 0.70-0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.
(1) 背景:由于缺乏具有补充高通量分子数据的临床结果的病史,机器学习 (ML) 方法很少用于基于组学的癌症药物处方。这会导致大多数 ML 方法过度训练和高度脆弱。最近,我们提出了一种称为浮动窗口投影分离器 (FloWPS) 的混合全局-局部 ML 方法,该方法避免了特征空间中的外推。其核心属性是数据修剪,即针对特定样本去除不相关的特征。(2) 方法:在这里,我们将 FloWPS 应用于七种流行的 ML 方法,包括线性 SVM、最近邻 (kNN)、随机森林 (RF)、Tikhonov (岭) 回归 (RR)、二项式朴素贝叶斯 (BNB)、自适应增强 (ADA) 和多层感知器 (MLP)。(3) 结果:我们对 21 个高通量基因表达数据集(每个数据集 41-235 个样本)进行了计算实验,总共代表了 1778 名接受化疗治疗的已知反应的癌症患者。FloWPS 从根本上提高了所有全局 ML 方法(SVM、RF、BNB、ADA、MLP)的分类器质量,其中治疗反应分类器的接收器操作特征曲线 (ROC AUC) 从 0.61-0.88 范围增加到 0.70-0.94。我们通过询问不同 ML 方法在同一模型数据集中不同特征的重要性,测试了 FloWPS 增强方法的过度训练情况。(4) 结论:我们表明,FloWPS 增加了不同 ML 方法之间特征重要性的相关性,这表明其对过度训练具有鲁棒性。在所有测试的数据集上,FloWPS 数据修剪的最佳性能观察到 BNB 方法,这对于在个性化肿瘤学中进一步构建 ML 分类器可能很有价值。