Center for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
Institut Curie, 75248 Paris, France.
Int J Mol Sci. 2021 May 12;22(10):5118. doi: 10.3390/ijms22105118.
Identification of the protein targets of hit molecules is essential in the drug discovery process. Target prediction with machine learning algorithms can help accelerate this search, limiting the number of required experiments. However, Drug-Target Interactions databases used for training present high statistical bias, leading to a high number of false positives, thus increasing time and cost of experimental validation campaigns. To minimize the number of false positives among predicted targets, we propose a new scheme for choosing negative examples, so that each protein and each drug appears an equal number of times in positive and negative examples. We artificially reproduce the process of target identification for three specific drugs, and more globally for 200 approved drugs. For the detailed three drug examples, and for the larger set of 200 drugs, training with the proposed scheme for the choice of negative examples improved target prediction results: the average number of false positives among the top ranked predicted targets decreased, and overall, the rank of the true targets was improved.Our method corrects databases' statistical bias and reduces the number of false positive predictions, and therefore the number of useless experiments potentially undertaken.
鉴定命中分子的蛋白靶标是药物发现过程中的关键步骤。利用机器学习算法进行靶标预测有助于加速这一搜索过程,减少所需实验的数量。然而,用于训练的药物-靶标相互作用数据库存在较高的统计偏差,导致大量的假阳性,从而增加实验验证活动的时间和成本。为了在预测靶标中最小化假阳性的数量,我们提出了一种选择负例的新方案,以使每种蛋白质和每种药物在正例和负例中出现的次数相等。我们人为地重现了三种特定药物的靶标鉴定过程,更广泛地说,还重现了 200 种已批准药物的靶标鉴定过程。对于详细的三种药物的例子,以及对于更大的 200 种药物的例子,使用我们提出的选择负例的方案进行训练可以改善靶标预测结果:排名最高的预测靶标中的假阳性数量平均减少,并且总体上,真实靶标的排名得到了提高。我们的方法纠正了数据库的统计偏差,减少了假阳性预测的数量,从而减少了潜在进行的无用实验的数量。