Zhai Jingjing, Tang Yunjia, Yuan Hao, Wang Longteng, Shang Haoli, Ma Chuang
State Kay Laboratory of Crop Stress Biology for Arid Areas, College of Life Sciences, Northwest A&F University Yangling, China.
Front Plant Sci. 2016 Dec 15;7:1914. doi: 10.3389/fpls.2016.01914. eCollection 2016.
The identification of genes associated with a given biological function in plants remains a challenge, although network-based gene prioritization algorithms have been developed for and many non-model plant species. Nevertheless, these network-based gene prioritization algorithms have encountered several problems; one in particular is that of unsatisfactory prediction accuracy due to limited network coverage, varying link quality, and/or uncertain network connectivity. Thus, a model that integrates complementary biological data may be expected to increase the prediction accuracy of gene prioritization. Toward this goal, we developed a novel gene prioritization method named RafSee, to rank candidate genes using a random forest algorithm that integrates sequence, evolutionary, and epigenetic features of plants. Subsequently, we proposed an integrative approach named RAP (Rank Aggregation-based data fusion for gene Prioritization), in which an order statistics-based meta-analysis was used to aggregate the rank of the network-based gene prioritization method and RafSee, for accurately prioritizing candidate genes involved in a pre-specific biological function. Finally, we showcased the utility of RAP by prioritizing 380 flowering-time genes in . The "leave-one-out" cross-validation experiment showed that RafSee could work as a complement to a current state-of-art network-based gene prioritization system (AraNet v2). Moreover, RAP ranked 53.68% (204/380) flowering-time genes higher than AraNet v2, resulting in an 39.46% improvement in term of the first quartile rank. Further evaluations also showed that RAP was effective in prioritizing genes-related to different abiotic stresses. To enhance the usability of RAP for and non-model plant species, an R package implementing the method is freely available at http://bioinfo.nwafu.edu.cn/software.
尽管已经为许多非模式植物物种开发了基于网络的基因优先级排序算法,但识别与植物特定生物学功能相关的基因仍然是一项挑战。然而,这些基于网络的基因优先级排序算法遇到了几个问题;特别是由于网络覆盖有限、链接质量不同和/或网络连通性不确定导致预测准确性不令人满意的问题。因此,整合互补生物学数据的模型可能有望提高基因优先级排序的预测准确性。为了实现这一目标,我们开发了一种名为RafSee的新型基因优先级排序方法,使用整合植物序列、进化和表观遗传特征的随机森林算法对候选基因进行排名。随后,我们提出了一种名为RAP(基于排名聚合的数据融合用于基因优先级排序)的整合方法,其中基于顺序统计的元分析用于聚合基于网络的基因优先级排序方法和RafSee的排名,以准确地对参与特定生物学功能的候选基因进行优先级排序。最后,我们通过对拟南芥中380个开花时间基因进行优先级排序展示了RAP的实用性。“留一法”交叉验证实验表明,RafSee可以作为当前基于网络的最先进基因优先级排序系统(AraNet v2)的补充。此外,RAP将53.68%(204/380)的开花时间基因排名高于AraNet v2,在第一四分位数排名方面提高了39.46%。进一步的评估还表明,RAP在对与不同非生物胁迫相关的基因进行优先级排序方面是有效的。为了提高RAP对拟南芥和非模式植物物种的可用性,实现该方法的R包可在http://bioinfo.nwafu.edu.cn/software免费获得。