Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL 60208, USA.
Biostatistics Division, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
Bioinformatics. 2018 Apr 1;34(7):1148-1156. doi: 10.1093/bioinformatics/btx748.
Inferring the structure of gene regulatory networks from high-throughput datasets remains an important and unsolved problem. Current methods are hampered by problems such as noise, low sample size, and incomplete characterizations of regulatory dynamics, leading to networks with missing and anomalous links. Integration of prior network information (e.g. from pathway databases) has the potential to improve reconstructions.
We developed a semi-supervised network reconstruction algorithm that enables the synthesis of information from partially known networks with time course gene expression data. We adapted partial least square-variable importance in projection (VIP) for time course data and used reference networks to simulate expression data from which null distributions of VIP scores are generated and used to estimate edge probabilities for input expression data. By using simulated dynamics to generate reference distributions, this approach incorporates previously known regulatory relationships and links the network to the dynamics to form a semi-supervised approach that discovers novel and anomalous connections. We applied this approach to data from a sleep deprivation study with KEGG pathways treated as prior networks, as well as to synthetic data from several DREAM challenges, and find that it is able to recover many of the true edges and identify errors in these networks, suggesting its ability to derive posterior networks that accurately reflect gene expression dynamics.
R code is available at https://github.com/pn51/postPLSR.
Supplementary data are available at Bioinformatics online.
从高通量数据集推断基因调控网络的结构仍然是一个重要且未解决的问题。当前的方法受到噪声、样本量小以及调控动态不完全描述等问题的阻碍,导致网络中存在缺失和异常的链接。整合先前的网络信息(例如来自途径数据库的信息)有可能改善重建。
我们开发了一种半监督网络重建算法,能够将部分已知网络的信息与时间过程基因表达数据相结合。我们为时间过程数据改编了偏最小二乘变量重要性投影(VIP),并使用参考网络模拟表达数据,从这些数据中生成 VIP 得分的零分布,并用于估计输入表达数据的边缘概率。通过使用模拟动态生成参考分布,该方法整合了先前已知的调控关系,并将网络与动态联系起来,形成一种半监督方法,从而发现新的和异常的连接。我们将这种方法应用于一项睡眠剥夺研究的数据,其中 KEGG 途径被视为先验网络,以及来自几个 DREAM 挑战的合成数据,发现它能够恢复许多真实的边缘,并识别这些网络中的错误,这表明它能够得出准确反映基因表达动态的后验网络。
R 代码可在 https://github.com/pn51/postPLSR 上获得。
补充数据可在《生物信息学》在线获得。