Suppr超能文献

多元线性模型中的变量选择方法:在液相色谱 - 质谱代谢组学数据中的应用

A variable selection approach in the multivariate linear model: an application to LC-MS metabolomics data.

作者信息

Perrot-Dockès Marie, Lévy-Leduc Céline, Chiquet Julien, Sansonnet Laure, Brégère Margaux, Étienne Marie-Pierre, Robin Stéphane, Genta-Jouve Grégory

机构信息

UMR MIA-Paris, AgroParisTech, INRA - Université Paris-Saclay, 75005 Paris, France.

UMR CNRS 8638 Comète - Université Paris-Descartes, CNRS, 75006 Paris, France.

出版信息

Stat Appl Genet Mol Biol. 2018 Sep 8;17(5):/j/sagmb.2018.17.issue-5/sagmb-2017-0077/sagmb-2017-0077.xml. doi: 10.1515/sagmb-2017-0077.

Abstract

Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. Applying statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure within the multivariate linear model framework that accounts for the dependence between the multiple responses. We shall focus on a specific type of dependence which consists in assuming that the responses of a given individual can be modelled as a time series. We propose a novel Lasso-based approach within the framework of the multivariate linear model taking into account the dependence structure by using different types of stationary processes covariance structures for the random error matrix. Our numerical experiments show that including the estimation of the covariance matrix of the random error matrix in the Lasso criterion dramatically improves the variable selection performance. Our approach is successfully applied to an untargeted LC-MS (Liquid Chromatography-Mass Spectrometry) data set made of African copals samples. Our methodology is implemented in the R package MultiVarSel which is available from the Comprehensive R Archive Network (CRAN).

摘要

组学数据的特征在于存在强依赖结构,这些结构要么源于数据采集,要么源于某些潜在的生物过程。应用未根据依赖模式调整变量选择步骤的统计程序可能会导致功效损失和虚假变量的选择。本文的目标是在多变量线性模型框架内提出一种变量选择程序,该程序考虑多个响应之间的依赖性。我们将专注于一种特定类型的依赖,即假设给定个体的响应可以建模为一个时间序列。我们在多变量线性模型框架内提出了一种基于套索的新方法,通过对随机误差矩阵使用不同类型的平稳过程协方差结构来考虑依赖结构。我们的数值实验表明,在套索准则中纳入随机误差矩阵协方差矩阵的估计可显著提高变量选择性能。我们的方法成功应用于由非洲柯巴脂样品组成的非靶向液相色谱 - 质谱(LC-MS)数据集。我们的方法在R包MultiVarSel中实现,该包可从综合R存档网络(CRAN)获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验