Dahinden Corinne, Parmigiani Giovanni, Emerick Mark C, Bühlmann Peter
Seminar für Statistik, ETH Zürich, CH-8092 Zürich, Switzerland.
BMC Bioinformatics. 2007 Dec 11;8:476. doi: 10.1186/1471-2105-8-476.
The joint analysis of several categorical variables is a common task in many areas of biology, and is becoming central to systems biology investigations whose goal is to identify potentially complex interaction among variables belonging to a network. Interactions of arbitrary complexity are traditionally modeled in statistics by log-linear models. It is challenging to extend these to the high dimensional and potentially sparse data arising in computational biology. An important example, which provides the motivation for this article, is the analysis of so-called full-length cDNA libraries of alternatively spliced genes, where we investigate relationships among the presence of various exons in transcript species.
We develop methods to perform model selection and parameter estimation in log-linear models for the analysis of sparse contingency tables, to study the interaction of two or more factors. Maximum Likelihood estimation of log-linear model coefficients might not be appropriate because of the presence of zeros in the table's cells, and new methods are required. We propose a computationally efficient l1-penalization approach extending the Lasso algorithm to this context, and compare it to other procedures in a simulation study. We then illustrate these algorithms on contingency tables arising from full-length cDNA libraries.
We propose regularization methods that can be used successfully to detect complex interaction patterns among categorical variables in a broad range of biological problems involving categorical variables.
对多个分类变量进行联合分析是生物学许多领域的常见任务,并且正成为系统生物学研究的核心,其目标是识别属于一个网络的变量之间潜在的复杂相互作用。传统上,任意复杂程度的相互作用在统计学中通过对数线性模型进行建模。将这些模型扩展到计算生物学中出现的高维且可能稀疏的数据具有挑战性。一个重要的例子,也是本文的动机所在,是对所谓的可变剪接基因的全长cDNA文库进行分析,在这个例子中我们研究转录本物种中各种外显子的存在之间的关系。
我们开发了在对数线性模型中进行模型选择和参数估计的方法,用于分析稀疏列联表,以研究两个或更多因素的相互作用。由于列联表单元格中存在零值,对数线性模型系数的最大似然估计可能不合适,因此需要新的方法。我们提出了一种计算效率高的l1惩罚方法,将套索算法扩展到这种情况,并在模拟研究中将其与其他方法进行比较。然后我们在全长cDNA文库产生的列联表上展示这些算法。
我们提出的正则化方法能够成功地用于检测广泛的涉及分类变量的生物学问题中分类变量之间的复杂相互作用模式。