Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham B15 2TT, UK.
Institute of Translational Medicine, University of Birmingham, Birmingham B15 2TT, UK.
Int J Mol Sci. 2020 Oct 23;21(21):7886. doi: 10.3390/ijms21217886.
Inferring the topology of a gene regulatory network (GRN) from gene expression data is a challenging but important undertaking for gaining a better understanding of gene regulation. Key challenges include working with noisy data and dealing with a higher number of genes than samples. Although a number of different methods have been proposed to infer the structure of a GRN, there are large discrepancies among the different inference algorithms they adopt, rendering their meaningful comparison challenging. In this study, we used two methods, namely the MIDER (Mutual Information Distance and Entropy Reduction) and the PLSNET (Partial least square based feature selection) methods, to infer the structure of a GRN directly from data and computationally validated our results. Both methods were applied to different gene expression datasets resulting from inflammatory bowel disease (IBD), pancreatic ductal adenocarcinoma (PDAC), and acute myeloid leukaemia (AML) studies. For each case, gene regulators were successfully identified. For example, for the case of the IBD dataset, the family genes were identified as key regulators while upon analysing the PDAC dataset, the and genes were depicted. We further demonstrate that an ensemble-based approach, that combines the output of the MIDER and PLSNET algorithms, can infer the structure of a GRN from data with higher accuracy. We have also estimated the number of the samples required for potential future validation studies. Here, we presented our proposed analysis framework that caters not only to candidate regulator genes prediction for potential validation experiments but also an estimation of the number of samples required for these experiments.
从基因表达数据推断基因调控网络 (GRN) 的拓扑结构是一项具有挑战性但很重要的工作,可以帮助我们更好地理解基因调控。主要的挑战包括处理噪声数据和处理比样本数量更多的基因。尽管已经提出了许多不同的方法来推断 GRN 的结构,但它们采用的不同推断算法之间存在很大差异,这使得对它们进行有意义的比较具有挑战性。在这项研究中,我们使用了两种方法,即 MIDER(互信息距离和熵减少)和 PLSNET(基于偏最小二乘的特征选择)方法,直接从数据中推断 GRN 的结构,并对我们的结果进行了计算验证。这两种方法都应用于不同的基因表达数据集,这些数据集来自炎症性肠病 (IBD)、胰腺导管腺癌 (PDAC) 和急性髓系白血病 (AML) 的研究。对于每种情况,都成功地识别了基因调节剂。例如,对于 IBD 数据集的情况, 家族基因被鉴定为关键调节剂,而在分析 PDAC 数据集时,则描绘了 和 基因。我们进一步证明,基于集成的方法,即结合 MIDER 和 PLSNET 算法的输出,可以更准确地从数据中推断 GRN 的结构。我们还估计了潜在未来验证研究所需的样本数量。在这里,我们提出了我们的分析框架,不仅可以预测候选调节剂基因,还可以估计这些实验所需的样本数量。