Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN.
Center for Structure Biology, Center for Cancer Research, National Cancer Institute, Frederick, MD.
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac199.
Internal validation is the most popular evaluation strategy used for drug-target predictive models. The simple random shuffling in the cross-validation, however, is not always ideal to handle large, diverse and copious datasets as it could potentially introduce bias. Hence, these predictive models cannot be comprehensively evaluated to provide insight into their general performance on a variety of use-cases (e.g. permutations of different levels of connectiveness and categories in drug and target space, as well as validations based on different data sources). In this work, we introduce a benchmark, BETA, that aims to address this gap by (i) providing an extensive multipartite network consisting of 0.97 million biomedical concepts and 8.5 million associations, in addition to 62 million drug-drug and protein-protein similarities and (ii) presenting evaluation strategies that reflect seven cases (i.e. general, screening with different connectivity, target and drug screening based on categories, searching for specific drugs and targets and drug repurposing for specific diseases), a total of seven Tests (consisting of 344 Tasks in total) across multiple sampling and validation strategies. Six state-of-the-art methods covering two broad input data types (chemical structure- and gene sequence-based and network-based) were tested across all the developed Tasks. The best-worst performing cases have been analyzed to demonstrate the ability of the proposed benchmark to identify limitations of the tested methods for running over the benchmark tasks. The results highlight BETA as a benchmark in the selection of computational strategies for drug repurposing and target discovery.
内部验证是用于药物-靶标预测模型的最流行的评估策略。然而,交叉验证中的简单随机洗牌并不总是理想的,因为它可能会引入偏差,无法处理大型、多样化和丰富的数据集。因此,这些预测模型不能进行全面评估,无法深入了解它们在各种用例(例如药物和靶标空间中不同连接性和类别水平的排列,以及基于不同数据源的验证)上的整体性能。在这项工作中,我们引入了一个基准 BETA,旨在通过以下方式解决这一差距:(i) 提供一个广泛的多部分网络,包含 97 万个生物医学概念和 850 万种关联,以及 6200 万种药物-药物和蛋白质-蛋白质相似性;(ii) 提出评估策略,反映七种情况(即一般情况、不同连接性的筛选、基于类别进行的靶标和药物筛选、寻找特定的药物和靶标、药物再利用治疗特定疾病),总共涵盖七种测试(包含 344 个任务),并涉及多种抽样和验证策略。六种最先进的方法涵盖了两种广泛的输入数据类型(化学结构和基因序列以及基于网络),并在所有开发的任务中进行了测试。对表现最好和最差的情况进行了分析,以证明所提出的基准有能力识别测试方法在执行基准任务时的局限性。结果突出了 BETA 作为药物再利用和靶标发现计算策略选择的基准。