Genetics Institute, University College London, Darwin Building, Gower Street, WC1E 6BT London, UK.
Artif Intell Med. 2013 Mar;57(3):207-17. doi: 10.1016/j.artmed.2012.12.006. Epub 2013 Feb 8.
Modelling the associations from high-throughput experimental molecular data has provided unprecedented insights into biological pathways and signalling mechanisms. Graphical models and networks have especially proven to be useful abstractions in this regard. Ad hoc thresholds are often used in conjunction with structure learning algorithms to determine significant associations. The present study overcomes this limitation by proposing a statistically motivated approach for identifying significant associations in a network.
A new method that identifies significant associations in graphical models by estimating the threshold minimising the L1 norm between the cumulative distribution function (CDF) of the observed edge confidences and those of its asymptotic counterpart is proposed. The effectiveness of the proposed method is demonstrated on popular synthetic data sets as well as publicly available experimental molecular data corresponding to gene and protein expression profiles.
The improved performance of the proposed approach is demonstrated across the synthetic data sets using sensitivity, specificity and accuracy as performance metrics. The results are also demonstrated across varying sample sizes and three different structure learning algorithms with widely varying assumptions. In all cases, the proposed approach has specificity and accuracy close to 1, while sensitivity increases linearly in the logarithm of the sample size. The estimated threshold systematically outperforms common ad hoc ones in terms of sensitivity while maintaining comparable levels of specificity and accuracy. Networks from experimental data sets are reconstructed accurately with respect to the results from the original papers.
Current studies use structure learning algorithms in conjunction with ad hoc thresholds for identifying significant associations in graphical abstractions of biological pathways and signalling mechanisms. Such an ad hoc choice can have pronounced effect on attributing biological significance to the associations in the resulting network and possible downstream analysis. The statistically motivated approach presented in this study has been shown to outperform ad hoc thresholds and is expected to alleviate spurious conclusions of significant associations in such graphical abstractions.
通过对高通量实验分子数据进行建模,可以深入了解生物途径和信号机制。图形模型和网络在这方面尤其被证明是有用的抽象。在结构学习算法中经常结合使用特定的阈值来确定显著的关联。本研究通过提出一种在网络中识别显著关联的统计驱动方法克服了这一局限性。
提出了一种新的方法,通过估计最小化观察到的边缘置信度累积分布函数(CDF)与渐近对应物之间的 L1 范数的阈值来识别图形模型中的显著关联。在所提出的方法的有效性在流行的合成数据集以及与基因和蛋白质表达谱相对应的公开可用的实验分子数据上得到了证明。
在所提出的方法的有效性在流行的合成数据集以及与基因和蛋白质表达谱相对应的公开可用的实验分子数据上得到了证明。在所提出的方法的有效性在流行的合成数据集以及与基因和蛋白质表达谱相对应的公开可用的实验分子数据上得到了证明。在所提出的方法的有效性在流行的合成数据集以及与基因和蛋白质表达谱相对应的公开可用的实验分子数据上得到了证明。使用灵敏度、特异性和准确性作为性能指标,在合成数据集上证明了所提出的方法的改进性能。还使用广泛不同的假设的三种不同的结构学习算法,在不同的样本大小下证明了结果。在所有情况下,所提出的方法的特异性和准确性接近 1,而灵敏度随着样本大小的对数线性增加。在所提出的方法中,所估计的阈值在保持类似的特异性和准确性水平的同时,在灵敏度方面系统地优于常见的特定阈值。实验数据集的网络相对于原始论文的结果准确地重建。
当前的研究使用结构学习算法和特定的阈值来识别生物途径和信号机制的图形抽象中的显著关联。这种特定的选择可能会对归因于网络中关联的生物学意义以及可能的下游分析产生显著影响。本研究中提出的统计驱动方法已被证明优于特定的阈值,并有望减轻此类图形抽象中显著关联的虚假结论。