Bhadra Sahely, Bhattacharyya Chiranjib, Chandra Nagasuma R, Mian I Saira
Department of Computer Science and Automation, Indian Institute of Science, Bangalore, Karnataka, India.
Algorithms Mol Biol. 2009 Feb 24;4:5. doi: 10.1186/1748-7188-4-5.
A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data.
The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification.
A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational - experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.
基因网络可以表示为一个有向图,其中节点对应一个基因,有向边指定一个基因对另一个基因的影响方向。从转录谱数据重建此类网络仍然是一项重要且具有挑战性的工作。转录谱指定了感兴趣的生物样本中许多基因的丰度。从高维转录谱数据学习基因网络结构的主流策略假定稀疏性和线性。许多方法考虑相对较小的有向图,推断节点数多达几百个的图。这项工作研究基因网络的大型无向图表示,即具有数千个节点的图,其中两个节点之间的无向边不表示影响方向,以及从转录谱数据估计这种稀疏线性基因网络(SLGN)结构的问题。
结构学习任务被转化为一个稀疏线性回归问题,然后将其作为一个套索(l1约束拟合)问题提出,并最终通过制定一个线性规划(LP)来解决。根据留一法误差给出了这种方法的泛化误差界。使用模拟数据和真实数据对LP-SLGN的准确性和实用性进行了定量和定性评估。逆向工程评估与方法对话(DREAM)计划提供了金标准数据集和评估指标,这些指标能够并便于比较用于推导网络结构的算法。从INSILICO1、INSILICO2和INSILICO3模拟的DREAM2数据集估计的LP-SLGN结构与DREAM2竞赛中排名第一和/或第二的团队提出的结构相当。从两个已发表的酿酒酵母细胞周期转录谱数据集估计的LP-SLGN结构捕获了已知的调控关联。在每个酿酒酵母LP-SLGN中,具有特定度数的节点数量遵循近似幂律,这表明其度数分布与在现实世界网络中观察到的相似。对这些LP-SLGN的检查提出了适合实验验证的生物学假设。
一种基于LP的统计稳健且计算高效的方法,用于从高维数据估计大型稀疏无向图的拓扑结构,产生了基因网络的表示,这些表示在生物学上是合理的,并且是真实基因网络结构的有用抽象。对学习到的LP-SLGN的统计和拓扑性质的分析可能具有实际价值;例如,具有高随机游走介数(一种衡量图中节点中心性的指标)的基因是干预研究的良好候选对象,因此是旨在推断更现实和复杂的基因网络概率有向图形模型表示的综合计算 - 实验研究。这里描述的稀疏线性回归问题的基于LP的解决方案可能提供一种从转录谱和转录因子结合基序数据学习转录因子网络结构的方法。