Crawford Jake, Chikina Maria, Greene Casey S
Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.
Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, United States.
Bioinform Adv. 2024 Jan 24;4(1):vbae004. doi: 10.1093/bioadv/vbae004. eCollection 2024.
Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers.
After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated.
The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein (2013) dataset are available at https://doi.org/10.6084/m9.figshare.22728644.
大多数模型可以使用各种优化方法来拟合数据。虽然在基于机器学习的研究中经常报告模型选择,但优化器却不常被提及。我们应用了Python的scikit-learn包中实现的LASSO逻辑回归的两种不同实现方式,使用两种不同的优化方法(坐标下降法,在liblinear库中实现,以及随机梯度下降法,即SGD),从多种泛癌驱动基因的基因表达中预测突变状态和基因必需性。对于不同程度的正则化,我们比较了优化器之间的性能和模型稀疏性。
在模型选择和调优之后,我们发现liblinear和SGD的表现趋于相当。liblinear模型需要对正则化强度进行更广泛的调优,在高模型稀疏性(更多非零系数)时表现最佳,但不需要选择学习率参数。SGD模型需要调整学习率才能表现良好,但随着正则化强度降低,在不同模型稀疏性下通常表现得更稳健。考虑到这些权衡,我们认为优化器的选择应作为模型选择和验证过程的一部分清晰报告,以便读者和审稿人能更好地理解结果产生的背景。