灵活的数据修剪可提高基于组学的个体化肿瘤学中全局机器学习方法的性能。

Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology.

机构信息

OmicsWayCorp, Walnut, CA 91788, USA.

Institute for Personailzed Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia.

出版信息

Int J Mol Sci. 2020 Jan 22;21(3):713. doi: 10.3390/ijms21030713.

DOI:10.3390/ijms21030713

PMID:31979006

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7037338/

Abstract

(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naïve Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41-235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61-0.88 range to 0.70-0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.

摘要

(1) 背景：由于缺乏具有补充高通量分子数据的临床结果的病史，机器学习 (ML) 方法很少用于基于组学的癌症药物处方。这会导致大多数 ML 方法过度训练和高度脆弱。最近，我们提出了一种称为浮动窗口投影分离器 (FloWPS) 的混合全局-局部 ML 方法，该方法避免了特征空间中的外推。其核心属性是数据修剪，即针对特定样本去除不相关的特征。(2) 方法：在这里，我们将 FloWPS 应用于七种流行的 ML 方法，包括线性 SVM、最近邻 (kNN)、随机森林 (RF)、Tikhonov (岭) 回归 (RR)、二项式朴素贝叶斯 (BNB)、自适应增强 (ADA) 和多层感知器 (MLP)。(3) 结果：我们对 21 个高通量基因表达数据集（每个数据集 41-235 个样本）进行了计算实验，总共代表了 1778 名接受化疗治疗的已知反应的癌症患者。FloWPS 从根本上提高了所有全局 ML 方法（SVM、RF、BNB、ADA、MLP）的分类器质量，其中治疗反应分类器的接收器操作特征曲线 (ROC AUC) 从 0.61-0.88 范围增加到 0.70-0.94。我们通过询问不同 ML 方法在同一模型数据集中不同特征的重要性，测试了 FloWPS 增强方法的过度训练情况。(4) 结论：我们表明，FloWPS 增加了不同 ML 方法之间特征重要性的相关性，这表明其对过度训练具有鲁棒性。在所有测试的数据集上，FloWPS 数据修剪的最佳性能观察到 BNB 方法，这对于在个性化肿瘤学中进一步构建 ML 分类器可能很有价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/326f/7037338/c21cc1b2da53/ijms-21-00713-g001.jpg

相似文献

Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology.灵活的数据修剪可提高基于组学的个体化肿瘤学中全局机器学习方法的性能。

Int J Mol Sci. 2020 Jan 22;21(3):713. doi: 10.3390/ijms21030713.

FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier.浮动窗口投影分离器（FloWPS）：一种用于支持向量机（SVM）的数据修剪工具，以提高分类器的鲁棒性。

Front Genet. 2019 Jan 15;9:717. doi: 10.3389/fgene.2018.00717. eCollection 2018.

CRlncRC: a machine learning-based method for cancer-related long noncoding RNA identification using integrated features.CRlncRC：一种基于机器学习的方法，利用整合特征识别癌症相关长链非编码RNA

BMC Med Genomics. 2018 Dec 31;11(Suppl 6):120. doi: 10.1186/s12920-018-0436-9.

Classification of THz pulse signals using two-dimensional cross-correlation feature extraction and non-linear classifiers.基于二维互相关特征提取和非线性分类器的太赫兹脉冲信号分类

Comput Methods Programs Biomed. 2016 Apr;127:64-82. doi: 10.1016/j.cmpb.2016.01.017. Epub 2016 Feb 1.

High-Throughput Mutation Data Now Complement Transcriptomic Profiling: Advances in Molecular Pathway Activation Analysis Approach in Cancer Biology.高通量突变数据如今补充了转录组分析：癌症生物学中分子通路激活分析方法的进展

Cancer Inform. 2019 Mar 25;18:1176935119838844. doi: 10.1177/1176935119838844. eCollection 2019.

Wrapper method for feature selection to classify cardiac arrhythmia.用于心律失常分类的特征选择包装方法。

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:3656-3659. doi: 10.1109/EMBC.2017.8037650.

Machine learning models in breast cancer survival prediction.用于乳腺癌生存预测的机器学习模型。

Technol Health Care. 2016;24(1):31-42. doi: 10.3233/THC-151071.

Clear Cell Renal Cell Carcinoma: Machine Learning-Based Quantitative Computed Tomography Texture Analysis for Prediction of Fuhrman Nuclear Grade.透明细胞肾细胞癌：基于机器学习的定量 CT 纹理分析预测 Fuhrman 核分级。

Eur Radiol. 2019 Mar;29(3):1153-1163. doi: 10.1007/s00330-018-5698-2. Epub 2018 Aug 30.

Comparison of machine learning classifiers for differentiation of grade 1 from higher gradings in meningioma: A multicenter radiomics study.基于多中心影像组学研究的机器学习分类器在鉴别脑膜瘤 1 级与高级别脑膜瘤中的应用比较。

Magn Reson Imaging. 2019 Nov;63:244-249. doi: 10.1016/j.mri.2019.08.011. Epub 2019 Aug 16.

Identification of Diagnostic Markers for Major Depressive Disorder Using Machine Learning Methods.使用机器学习方法识别重度抑郁症的诊断标志物

Front Neurosci. 2021 Jun 18;15:645998. doi: 10.3389/fnins.2021.645998. eCollection 2021.

引用本文的文献

Bioinformatics in Russia: history and present-day landscape.俄罗斯的生物信息学：历史与现状

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae513.

Machine learning algorithms' application to predict childhood vaccination among children aged 12-23 months in Ethiopia: Evidence 2016 Ethiopian Demographic and Health Survey dataset.机器学习算法在预测埃塞俄比亚 12-23 个月儿童疫苗接种率中的应用：基于 2016 年埃塞俄比亚人口与健康调查数据集的证据。

PLoS One. 2023 Oct 18;18(10):e0288867. doi: 10.1371/journal.pone.0288867. eCollection 2023.

Uniformly shaped harmonization combines human transcriptomic data from different platforms while retaining their biological properties and differential gene expression patterns.形状一致的归一化整合了来自不同平台的人类转录组数据，同时保留其生物学特性和差异基因表达模式。

Front Mol Biosci. 2023 Sep 6;10:1237129. doi: 10.3389/fmolb.2023.1237129. eCollection 2023.

Machine learning for predicting accuracy of lung and liver tumor motion tracking using radiomic features.使用影像组学特征的机器学习预测肺和肝肿瘤运动追踪的准确性

Quant Imaging Med Surg. 2023 Mar 1;13(3):1605-1618. doi: 10.21037/qims-22-621. Epub 2023 Jan 9.

Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect.转录组协调作为抑制跨平台偏差和批次效应的方法

Biomedicines. 2022 Sep 18;10(9):2318. doi: 10.3390/biomedicines10092318.

Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer.机器学习：癌症多组学数据分析的新前景。

Front Genet. 2022 Jan 27;13:824451. doi: 10.3389/fgene.2022.824451. eCollection 2022.

Recent Trends in Cancer Genomics and Bioinformatics Tools Development.癌症基因组学和生物信息学工具开发的最新趋势。

Int J Mol Sci. 2021 Nov 10;22(22):12146. doi: 10.3390/ijms222212146.

Machine Learning Applicability for Classification of PAD/VCD Chemotherapy Response Using 53 Multiple Myeloma RNA Sequencing Profiles.利用53份多发性骨髓瘤RNA测序图谱，机器学习在PAD/VCD化疗反应分类中的适用性

Front Oncol. 2021 Apr 15;11:652063. doi: 10.3389/fonc.2021.652063. eCollection 2021.

Editorial: Next Generation Sequencing Based Diagnostic Approaches in Clinical Oncology.社论：临床肿瘤学中基于下一代测序的诊断方法

Front Oncol. 2021 Jan 28;10:635555. doi: 10.3389/fonc.2020.635555. eCollection 2020.

System, Method and Software for Calculation of a Cannabis Drug Efficiency Index for the Reduction of Inflammation.用于计算大麻药物抗炎效率指数的系统、方法和软件。

Int J Mol Sci. 2020 Dec 31;22(1):388. doi: 10.3390/ijms22010388.

本文引用的文献

DeePathology: Deep Multi-Task Learning for Inferring Molecular Pathology from Cancer Transcriptome.DeePathology：从癌症转录组推断分子病理学的深度多任务学习。

Sci Rep. 2019 Nov 11;9(1):16526. doi: 10.1038/s41598-019-52937-5.

RNA sequencing for research and diagnostics in clinical oncology.临床肿瘤学中的研究和诊断用 RNA 测序。

Semin Cancer Biol. 2020 Feb;60:311-323. doi: 10.1016/j.semcancer.2019.07.010. Epub 2019 Aug 11.

New Paradigm of Machine Learning (ML) in Personalized Oncology: Data Trimming for Squeezing More Biomarkers From Clinical Datasets.个性化肿瘤学中机器学习（ML）的新范式：通过数据修剪从临床数据集中挖掘更多生物标志物

Front Oncol. 2019 Jul 17;9:658. doi: 10.3389/fonc.2019.00658. eCollection 2019.

A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer.一种用于识别指导乳腺癌治疗的基因生物标志物的机器学习方法。

Front Genet. 2019 Mar 27;10:256. doi: 10.3389/fgene.2019.00256. eCollection 2019.

Cancer Inform. 2019 Mar 25;18:1176935119838844. doi: 10.1177/1176935119838844. eCollection 2019.

Clinical intelligence: New machine learning techniques for predicting clinical drug response.临床智能：预测临床药物反应的新机器学习技术。

Comput Biol Med. 2019 Apr;107:302-322. doi: 10.1016/j.compbiomed.2018.12.017. Epub 2019 Jan 3.

Pathway Based Analysis of Mutation Data Is Efficient for Scoring Target Cancer Drugs.基于通路的突变数据分析对癌症靶向药物评分很有效。

Front Pharmacol. 2019 Jan 23;10:1. doi: 10.3389/fphar.2019.00001. eCollection 2019.

Shambhala: a platform-agnostic data harmonizer for gene expression data.香巴拉：一个用于基因表达数据的数据协调器，与平台无关。

BMC Bioinformatics. 2019 Feb 6;20(1):66. doi: 10.1186/s12859-019-2641-8.

Front Genet. 2019 Jan 15;9:717. doi: 10.3389/fgene.2018.00717. eCollection 2018.

In Silico Prediction of Blood-Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods.基于机器学习和重采样方法的化合物血脑屏障透过性的计算预测。

ChemMedChem. 2018 Oct 22;13(20):2189-2201. doi: 10.1002/cmdc.201800533. Epub 2018 Sep 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

灵活的数据修剪可提高基于组学的个体化肿瘤学中全局机器学习方法的性能。

Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献