B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, Spain.
Networking Biomedical Research Centre in the subject area of Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), Madrid, Spain.
PLoS Comput Biol. 2019 Sep 3;15(9):e1007276. doi: 10.1371/journal.pcbi.1007276. eCollection 2019 Sep.
In-silico identification of potential target genes for disease is an essential aspect of drug target discovery. Recent studies suggest that successful targets can be found through by leveraging genetic, genomic and protein interaction information. Here, we systematically tested the ability of 12 varied algorithms, based on network propagation, to identify genes that have been targeted by any drug, on gene-disease data from 22 common non-cancerous diseases in OpenTargets. We considered two biological networks, six performance metrics and compared two types of input gene-disease association scores. The impact of the design factors in performance was quantified through additive explanatory models. Standard cross-validation led to over-optimistic performance estimates due to the presence of protein complexes. In order to obtain realistic estimates, we introduced two novel protein complex-aware cross-validation schemes. When seeding biological networks with known drug targets, machine learning and diffusion-based methods found around 2-4 true targets within the top 20 suggestions. Seeding the networks with genes associated to disease by genetics decreased performance below 1 true hit on average. The use of a larger network, although noisier, improved overall performance. We conclude that diffusion-based prioritisers and machine learning applied to diffusion-based features are suited for drug discovery in practice and improve over simpler neighbour-voting methods. We also demonstrate the large impact of choosing an adequate validation strategy and the definition of seed disease genes.
基于网络传播的计算方法识别疾病潜在靶标基因是药物靶标发现的一个重要方面。最近的研究表明,可以通过利用遗传、基因组和蛋白质相互作用信息找到成功的靶标。在这里,我们在 OpenTargets 中 22 种常见非癌症疾病的基因-疾病数据上,系统地测试了 12 种不同的基于网络传播的算法识别任何药物靶向基因的能力。我们考虑了两种生物网络、六种性能指标,并比较了两种类型的输入基因-疾病关联评分。通过加性解释模型来量化设计因素对性能的影响。由于蛋白质复合物的存在,标准交叉验证导致了过高的性能估计。为了获得现实的估计,我们引入了两种新的蛋白质复合物感知交叉验证方案。当用已知的药物靶标种子生物网络时,机器学习和扩散方法在前 20 个建议中找到了大约 2-4 个真实的靶标。用通过遗传学与疾病相关的基因来播种网络会导致平均性能低于 1 个真实命中。虽然更大的网络更嘈杂,但它提高了整体性能。我们的结论是,基于扩散的优先排序和应用于扩散特征的机器学习适合药物发现实践,并优于更简单的邻居投票方法。我们还展示了选择适当的验证策略和定义种子疾病基因的重要性。