Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium.
Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium.
Bioinformatics. 2018 Jul 1;34(13):2254-2262. doi: 10.1093/bioinformatics/bty079.
Computational gene prioritization can aid in disease gene identification. Here, we propose pBRIT (prioritization using Bayesian Ridge regression and Information Theoretic model), a novel adaptive and scalable prioritization tool, integrating Pubmed abstracts, Gene Ontology, Sequence similarities, Mammalian and Human Phenotype Ontology, Pathway, Interactions, Disease Ontology, Gene Association database and Human Genome Epidemiology database, into the prediction model. We explore and address effects of sparsity and inter-feature dependencies within annotation sources, and the impact of bias towards specific annotations.
pBRIT models feature dependencies and sparsity by an Information-Theoretic (data driven) approach and applies intermediate integration based data fusion. Following the hypothesis that genes underlying similar diseases will share functional and phenotype characteristics, it incorporates Bayesian Ridge regression to learn a linear mapping between functional and phenotype annotations. Genes are prioritized on phenotypic concordance to the training genes. We evaluated pBRIT against nine existing methods, and on over 2000 HPO-gene associations retrieved after construction of pBRIT data sources. We achieve maximum AUC scores ranging from 0.92 to 0.96 against benchmark datasets and of 0.80 against the time-stamped HPO entries, indicating good performance with high sensitivity and specificity. Our model shows stable performance with regard to changes in the underlying annotation data, is fast and scalable for implementation in routine pipelines.
http://biomina.be/apps/pbrit/; https://bitbucket.org/medgenua/pbrit.
Supplementary data are available at Bioinformatics online.
计算基因优先级可以辅助疾病基因的识别。在这里,我们提出了 pBRIT(基于贝叶斯 Ridge 回归和信息论模型的优先级排序),这是一种新颖的自适应和可扩展的优先级排序工具,将 PubMed 摘要、基因本体、序列相似性、哺乳动物和人类表型本体、途径、相互作用、疾病本体、基因关联数据库和人类基因组流行病学数据库集成到预测模型中。我们探索并解决了注释来源内稀疏性和特征依赖性的影响,以及对特定注释的偏向的影响。
pBRIT 通过信息论(数据驱动)方法对特征依赖性和稀疏性进行建模,并应用基于中间整合的数据融合。基于这样的假设,即具有相似疾病的基因将共享功能和表型特征,它将贝叶斯 Ridge 回归纳入其中,以学习功能和表型注释之间的线性映射。根据与训练基因在表型上的一致性对基因进行优先级排序。我们将 pBRIT 与九种现有方法进行了评估,并在构建 pBRIT 数据源后检索到的 2000 多个 HPO-基因关联中进行了评估。我们针对基准数据集获得了从 0.92 到 0.96 的最大 AUC 分数,针对时间戳 HPO 条目获得了 0.80 的 AUC 分数,表明具有高灵敏度和特异性的良好性能。我们的模型在底层注释数据发生变化时表现出稳定的性能,快速且可扩展,适用于常规管道的实施。
http://biomina.be/apps/pbrit/; https://bitbucket.org/medgenua/pbrit。
补充数据可在 Bioinformatics 在线获取。