Suppr超能文献

优化器的困境:在转录组预测中,优化对模型选择有强烈影响。

Optimizer's dilemma: optimization strongly influences model selection in transcriptomic prediction.

作者信息

Crawford Jake, Chikina Maria, Greene Casey S

机构信息

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.

Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, United States.

出版信息

Bioinform Adv. 2024 Jan 24;4(1):vbae004. doi: 10.1093/bioadv/vbae004. eCollection 2024.

Abstract

MOTIVATION

Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers.

RESULTS

After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated.

AVAILABILITY AND IMPLEMENTATION

The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein  (2013) dataset are available at https://doi.org/10.6084/m9.figshare.22728644.

摘要

动机

大多数模型可以使用各种优化方法来拟合数据。虽然在基于机器学习的研究中经常报告模型选择,但优化器却不常被提及。我们应用了Python的scikit-learn包中实现的LASSO逻辑回归的两种不同实现方式,使用两种不同的优化方法(坐标下降法,在liblinear库中实现,以及随机梯度下降法,即SGD),从多种泛癌驱动基因的基因表达中预测突变状态和基因必需性。对于不同程度的正则化,我们比较了优化器之间的性能和模型稀疏性。

结果

在模型选择和调优之后,我们发现liblinear和SGD的表现趋于相当。liblinear模型需要对正则化强度进行更广泛的调优,在高模型稀疏性(更多非零系数)时表现最佳,但不需要选择学习率参数。SGD模型需要调整学习率才能表现良好,但随着正则化强度降低,在不同模型稀疏性下通常表现得更稳健。考虑到这些权衡,我们认为优化器的选择应作为模型选择和验证过程的一部分清晰报告,以便读者和审稿人能更好地理解结果产生的背景。

可用性和实现方式

本研究中用于进行分析的代码可在https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification获取。Vogelstein(2013年)数据集中所有基因的性能/正则化强度曲线可在https://doi.org/10.6084/m9.figshare.22728644获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90d5/10822580/8ea5e78ca2a5/vbae004f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验