Institute of Genetics and Biometry, Leibniz Institute for Farm Animal Biology, 18196, Dummerstorf, Germany.
Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA.
BMC Bioinformatics. 2020 Sep 15;21(1):407. doi: 10.1186/s12859-020-03725-w.
Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths.
Publicly available high-dimensional methylation data are used to compare seagull to the established R package SGL. The results of both packages enabled a precise prediction of biological age from DNA methylation status. But even though the results of seagull and SGL were very similar (R > 0.99), seagull computed the solution in a fraction of the time needed by SGL. Additionally, seagull enables the incorporation of weights for each penalized feature.
The following operators for linear regression models are available in seagull: lasso, group lasso, sparse-group lasso and Integrative LASSO with Penalty Factors (IPF-lasso). Thus, seagull is a convenient envelope of lasso variants.
生命科学中的生物问题的统计分析通常会导致高维线性模型。为了解决相应的方程组,惩罚方法通常是首选方法。如果解释变量的数量超过观测值的数量,或者出于某些生物学原因,出现多重共线性时,它们特别有用。然后,通过适当的感兴趣的函数来惩罚模型拟合优度。突出的例子是lasso、group lasso 和 sparse-group lasso。在这里,我们通过近端梯度下降为这些运算符提供了快速且数值上便宜的实现。通过 warm starts 实现了针对惩罚参数的网格搜索。通过回溯线搜索确定连续迭代之间的步长。最后,这里介绍的 R 包 seagull 生成完整的正则化路径。
使用公开的高维甲基化数据将 seagull 与成熟的 R 包 SGL 进行比较。这两个包的结果都能够从 DNA 甲基化状态准确预测生物年龄。但是,即使 seagull 和 SGL 的结果非常相似(R>0.99),seagull 的计算时间也只是 SGL 的一小部分。此外,seagull 还可以为每个惩罚特征添加权重。
seagull 中提供了以下用于线性回归模型的运算符:lasso、group lasso、sparse-group lasso 和带有惩罚因子的集成 LASSO(IPF-lasso)。因此,seagull 是 lasso 变体的便捷封装。