避免基因网络L1正则化推断中的陷阱。

Avoiding pitfalls in L1-regularised inference of gene networks.

作者信息

Tjärnberg Andreas, Nordling Torbjörn E M, Studham Matthew, Nelander Sven, Sonnhammer Erik L L

机构信息

Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden.

出版信息

Mol Biosyst. 2015 Jan;11(1):287-96. doi: 10.1039/c4mb00419a. Epub 2014 Nov 7.

DOI:10.1039/c4mb00419a

PMID:25377664

Abstract

Statistical regularisation methods such as LASSO and related L1 regularised regression methods are commonly used to construct models of gene regulatory networks. Although they can theoretically infer the correct network structure, they have been shown in practice to make errors, i.e. leave out existing links and include non-existing links. We show that L1 regularisation methods typically produce a poor network model when the analysed data are ill-conditioned, i.e. the gene expression data matrix has a high condition number, even if it contains enough information for correct network inference. However, the correct structure of network models can be obtained for informative data, data with such a signal to noise ratio that existing links can be proven to exist, when these methods fail, by using least-squares regression and setting small parameters to zero, or by using robust network inference, a recent method taking the intersection of all non-rejectable models. Since available experimental data sets are generally ill-conditioned, we recommend to check the condition number of the data matrix to avoid this pitfall of L1 regularised inference, and to also consider alternative methods.

摘要

诸如LASSO及相关的L1正则化回归方法等统计正则化方法通常用于构建基因调控网络模型。尽管它们在理论上能够推断出正确的网络结构，但实践表明它们会出错，即遗漏现有链接并包含不存在的链接。我们表明，当分析的数据病态时，即基因表达数据矩阵的条件数很高时，即使它包含足够的信息用于正确的网络推断，L1正则化方法通常也会产生较差的网络模型。然而，当这些方法失效时，对于信息丰富的数据，即具有能证明现有链接存在的信噪比的数据，通过使用最小二乘回归并将小参数设置为零，或者通过使用稳健网络推断（一种采用所有不可拒绝模型交集的最新方法），可以获得网络模型的正确结构。由于现有的实验数据集通常是病态的，我们建议检查数据矩阵的条件数，以避免L1正则化推断的这个陷阱，并且还应考虑替代方法。