IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):126-137. doi: 10.1109/TCBB.2020.2968442. Epub 2021 Feb 3.
Identifying target genes of transcription factors (TFs) is crucial to understand transcriptional regulation. However, our understanding of genome-wide TF targeting profile is limited due to the cost of large-scale experiments and intrinsic complexity of gene regulation. Thus, computational prediction methods are useful to predict unobserved TF-gene associations. Here, we develop a new Weighted Imputed Neighborhood-regularized Tri-Factorization one-class collaborative filtering algorithm, WINTF. It predicts unobserved target genes for TFs using known but noisy, incomplete, and biased TF-gene associations and protein-protein interaction networks. Our benchmark study shows that WINTF significantly outperforms its counterpart matrix factorization-based algorithms and tri-factorization methods that do not include weight, imputation, and neighbor-regularization, for TF-gene association prediction. When evaluated by independent datasets, accuracy is 37.8 percent on the top 495 predicted associations, an enrichment factor of 4.19 compared with random guess. Furthermore, many predicted novel associations are supported by literature evidence. Although we only use canonical TF-gene interaction data, WINTF can directly be applied to tissue-specific data when available. Thus, WINTF provides a potentially useful framework to integrate multiple omics data for further improvement of TF-gene prediction and applications to other sparse and noisy biological data. The benchmark dataset and source code are freely available at https://github.com/XieResearchGroup/WINTF.
鉴定转录因子(TFs)的靶基因对于理解转录调控至关重要。然而,由于大规模实验的成本和基因调控的内在复杂性,我们对全基因组 TF 靶向谱的理解是有限的。因此,计算预测方法对于预测未观察到的 TF-基因关联是有用的。在这里,我们开发了一种新的加权推断邻域正则化三因子化单类协同过滤算法 WINTF。它使用已知但存在噪声、不完整和有偏差的 TF-基因关联以及蛋白质-蛋白质相互作用网络来预测 TF 未观察到的靶基因。我们的基准研究表明,WINTF 在 TF-基因关联预测方面明显优于其基于矩阵分解的对应算法和不包括权重、推断和邻域正则化的三因子化方法。在独立数据集上进行评估时,前 495 个预测关联的准确率为 37.8%,与随机猜测相比富集因子为 4.19。此外,许多预测的新关联都得到了文献证据的支持。尽管我们仅使用规范的 TF-基因相互作用数据,但当有组织特异性数据可用时,WINTF 可以直接应用于该数据。因此,WINTF 为整合多种组学数据提供了一个潜在有用的框架,以进一步提高 TF-基因预测,并应用于其他稀疏和嘈杂的生物数据。基准数据集和源代码可在 https://github.com/XieResearchGroup/WINTF 上免费获得。