优化器的困境：在转录组预测中，优化对模型选择有强烈影响。

Optimizer's dilemma: optimization strongly influences model selection in transcriptomic prediction.

作者信息

Crawford Jake, Chikina Maria, Greene Casey S

机构信息

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.

Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, United States.

出版信息

Bioinform Adv. 2024 Jan 24;4(1):vbae004. doi: 10.1093/bioadv/vbae004. eCollection 2024.

DOI:10.1093/bioadv/vbae004

PMID:38282973

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10822580/

Abstract

MOTIVATION

Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers.

RESULTS

After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated.

AVAILABILITY AND IMPLEMENTATION

The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein (2013) dataset are available at https://doi.org/10.6084/m9.figshare.22728644.

摘要

动机

大多数模型可以使用各种优化方法来拟合数据。虽然在基于机器学习的研究中经常报告模型选择，但优化器却不常被提及。我们应用了Python的scikit-learn包中实现的LASSO逻辑回归的两种不同实现方式，使用两种不同的优化方法（坐标下降法，在liblinear库中实现，以及随机梯度下降法，即SGD），从多种泛癌驱动基因的基因表达中预测突变状态和基因必需性。对于不同程度的正则化，我们比较了优化器之间的性能和模型稀疏性。

结果

在模型选择和调优之后，我们发现liblinear和SGD的表现趋于相当。liblinear模型需要对正则化强度进行更广泛的调优，在高模型稀疏性（更多非零系数）时表现最佳，但不需要选择学习率参数。SGD模型需要调整学习率才能表现良好，但随着正则化强度降低，在不同模型稀疏性下通常表现得更稳健。考虑到这些权衡，我们认为优化器的选择应作为模型选择和验证过程的一部分清晰报告，以便读者和审稿人能更好地理解结果产生的背景。

可用性和实现方式

本研究中用于进行分析的代码可在https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification获取。Vogelstein（2013年）数据集中所有基因的性能/正则化强度曲线可在https://doi.org/10.6084/m9.figshare.22728644获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90d5/10822580/8ea5e78ca2a5/vbae004f1.jpg

相似文献

Optimizer's dilemma: optimization strongly influences model selection in transcriptomic prediction.

Bioinform Adv. 2024 Jan 24;4(1):vbae004. doi: 10.1093/bioadv/vbae004. eCollection 2024.

Deep convolutional neural network and IoT technology for healthcare.

Digit Health. 2024 Jan 17;10:20552076231220123. doi: 10.1177/20552076231220123. eCollection 2024 Jan-Dec.

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.

Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.

Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking.

Bioinformatics. 2022 Mar 4;38(6):1542-1549. doi: 10.1093/bioinformatics/btab848.

The effect of choosing optimizer algorithms to improve computer vision tasks: a comparative study.

Multimed Tools Appl. 2023;82(11):16591-16633. doi: 10.1007/s11042-022-13820-0. Epub 2022 Sep 28.

sparsesurv: a Python package for fitting sparse survival models via knowledge distillation.

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae521.

Selecting the best optimizers for deep learning-based medical image segmentation.

Front Radiol. 2023 Sep 21;3:1175473. doi: 10.3389/fradi.2023.1175473. eCollection 2023.

Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification.

BMC Bioinformatics. 2013 Jun 19;14:198. doi: 10.1186/1471-2105-14-198.

MCPeSe: Monte Carlo penalty selection for graphical lasso.

Bioinformatics. 2021 May 5;37(5):726-727. doi: 10.1093/bioinformatics/btaa734.

Erratum: Eyestalk Ablation to Increase Ovarian Maturation in Mud Crabs.

J Vis Exp. 2023 May 26(195). doi: 10.3791/6561.

引用本文的文献

Best holdout assessment is sufficient for cancer transcriptomic model selection.

Patterns (N Y). 2024 Dec 6;5(12):101115. doi: 10.1016/j.patter.2024.101115. eCollection 2024 Dec 13.

A how-to guide for code sharing in biology.

PLoS Biol. 2024 Sep 10;22(9):e3002815. doi: 10.1371/journal.pbio.3002815. eCollection 2024 Sep.

Reconstruction of Protein-Protein Interaction Network Based on DGO-SVM Method.

Curr Issues Mol Biol. 2024 Jul 12;46(7):7353-7372. doi: 10.3390/cimb46070436.

本文引用的文献

Identify Non-mutational p53 Functional Deficiency in Human Cancers.

Genomics Proteomics Bioinformatics. 2024 Dec 3;22(5). doi: 10.1093/gpbjnl/qzae064.

Evaluating cancer cell line and patient-derived xenograft recapitulation of tumor and non-diseased tissue gene expression profiles in silico.

Cancer Rep (Hoboken). 2023 Sep;6(9):e1874. doi: 10.1002/cnr2.1874. Epub 2023 Aug 2.

Identification of phenocopies improves prediction of targeted therapy response over DNA mutations alone.

NPJ Genom Med. 2022 Oct 17;7(1):58. doi: 10.1038/s41525-022-00328-7.

Ubiquitinated PCNA Drives USP1 Synthetic Lethality in Cancer.

Mol Cancer Ther. 2023 Feb 1;22(2):215-226. doi: 10.1158/1535-7163.MCT-22-0409.

Widespread redundancy in -omics profiles of cancer mutation states.

Genome Biol. 2022 Jun 27;23(1):137. doi: 10.1186/s13059-022-02705-y.

The ability to classify patients based on gene-expression data varies by algorithm and performance metric.

PLoS Comput Biol. 2022 Mar 11;18(3):e1009926. doi: 10.1371/journal.pcbi.1009926. eCollection 2022 Mar.

Prediction of PIK3CA mutations from cancer gene expression data.

PLoS One. 2020 Nov 9;15(11):e0241514. doi: 10.1371/journal.pone.0241514. eCollection 2020.

Targeted CRISPR screening identifies PRMT5 as synthetic lethality combinatorial target with gemcitabine in pancreatic cancer cells.

Proc Natl Acad Sci U S A. 2020 Nov 10;117(45):28068-28079. doi: 10.1073/pnas.2009899117. Epub 2020 Oct 23.

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations.

Genome Biol. 2020 May 11;21(1):109. doi: 10.1186/s13059-020-02021-3.

Benign overfitting in linear regression.

Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30063-30070. doi: 10.1073/pnas.1907378117. Epub 2020 Apr 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

优化器的困境：在转录组预测中，优化对模型选择有强烈影响。

Optimizer's dilemma: optimization strongly influences model selection in transcriptomic prediction.

作者信息

Crawford Jake, Chikina Maria, Greene Casey S

机构信息

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.

Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, United States.