Xie Bingqing, Agam Gady, Balasubramanian Sandhya, Xu Jinbo, Gilliam T Conrad, Maltsev Natalia, Börnigen Daniela
1 Department of Computer Science, Illinois Institute of Technology , Chicago, Illinois.
J Comput Biol. 2015 Apr;22(4):313-23. doi: 10.1089/cmb.2015.0001.
Identifying high-confidence candidate genes that are causative for disease phenotypes, from the large lists of variations produced by high-throughput genomics, can be both time-consuming and costly. The development of novel computational approaches, utilizing existing biological knowledge for the prioritization of such candidate genes, can improve the efficiency and accuracy of the biomedical data analysis. It can also reduce the cost of such studies by avoiding experimental validations of irrelevant candidates. In this study, we address this challenge by proposing a novel gene prioritization approach that ranks promising candidate genes that are likely to be involved in a disease or phenotype under study. This algorithm is based on the modified conditional random field (CRF) model that simultaneously makes use of both gene annotations and gene interactions, while preserving their original representation. We validated our approach on two independent disease benchmark studies by ranking candidate genes using network and feature information. Our results showed both high area under the curve (AUC) value (0.86), and more importantly high partial AUC (pAUC) value (0.1296), and revealed higher accuracy and precision at the top predictions as compared with other well-performed gene prioritization tools, such as Endeavour (AUC-0.82, pAUC-0.083) and PINTA (AUC-0.76, pAUC-0.066). We were able to detect more target genes (9/18/19/27) on top positions (1/5/10/20) compared to Endeavour (3/11/14/23) and PINTA (6/10/13/18). To demonstrate its usability, we applied our method to a case study for the prediction of molecular mechanisms contributing to intellectual disability and autism. Our approach was able to correctly recover genes related to both disorders and provide suggestions for possible additional candidates based on their rankings and functional annotations.
从高通量基因组学产生的大量变异列表中识别导致疾病表型的高可信度候选基因既耗时又昂贵。利用现有生物学知识对这类候选基因进行优先级排序的新型计算方法的开发,可以提高生物医学数据分析的效率和准确性。它还可以通过避免对无关候选基因进行实验验证来降低此类研究的成本。在本研究中,我们通过提出一种新型基因优先级排序方法来应对这一挑战,该方法对可能参与所研究疾病或表型的有前景的候选基因进行排名。该算法基于改进的条件随机场(CRF)模型,该模型同时利用基因注释和基因相互作用,同时保留它们的原始表示。我们通过使用网络和特征信息对候选基因进行排名,在两项独立的疾病基准研究中验证了我们的方法。我们的结果显示曲线下面积(AUC)值较高(0.86),更重要的是部分AUC(pAUC)值较高(0.1296),并且与其他表现良好的基因优先级排序工具(如Endeavour(AUC - 0.82,pAUC - 0.083)和PINTA(AUC - 0.76,pAUC - 0.066))相比,在顶部预测中显示出更高的准确性和精确性。与Endeavour(3/11/14/23)和PINTA(6/10/13/18)相比,我们能够在顶部位置(1/5/10/20)检测到更多的靶基因(9/18/19/27)。为了证明其可用性,我们将我们的方法应用于一个案例研究,以预测导致智力残疾和自闭症的分子机制。我们的方法能够正确地找回与这两种疾病相关联的基因,并根据它们的排名和功能注释为可能的其他候选基因提供建议。