Crawford Jake, Chikina Maria, Greene Casey S
Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.
Patterns (N Y). 2024 Dec 6;5(12):101115. doi: 10.1016/j.patter.2024.101115. eCollection 2024 Dec 13.
Guidelines in statistical modeling for genomics hold that simpler models have advantages over more complex ones. Potential advantages include cost, interpretability, and improved generalization across datasets or biological contexts. We directly tested the assumption that small gene signatures generalize better by examining the generalization of mutation status prediction models across datasets (from cell lines to human tumors and vice versa) and biological contexts (holding out entire cancer types from pan-cancer data). We compared model selection between solely cross-validation performance and combining cross-validation performance with regularization strength. We did not observe that more regularized signatures generalized better. This result held across both generalization problems and for both linear models (LASSO logistic regression) and non-linear ones (neural networks). When the goal of an analysis is to produce generalizable predictive models, we recommend choosing the ones that perform best on held-out data or in cross-validation instead of those that are smaller or more regularized.
基因组学统计建模指南认为,较简单的模型比更复杂的模型具有优势。潜在优势包括成本、可解释性以及在不同数据集或生物学背景下更好的泛化能力。我们通过检查突变状态预测模型在不同数据集(从细胞系到人类肿瘤,反之亦然)和生物学背景(从泛癌数据中排除整个癌症类型)之间的泛化能力,直接检验了小基因特征具有更好泛化能力的假设。我们比较了仅基于交叉验证性能进行模型选择和将交叉验证性能与正则化强度相结合进行模型选择的情况。我们没有观察到正则化程度更高的特征具有更好的泛化能力。这一结果在两个泛化问题中均成立,并且在线性模型(LASSO逻辑回归)和非线性模型(神经网络)中都成立。当分析的目标是生成可泛化的预测模型时,我们建议选择在留出数据或交叉验证中表现最佳的模型,而不是那些规模更小或正则化程度更高的模型。