Suppr超能文献

大规模药物-靶点相互作用预测:Komet 算法与 LCIdb 数据集。

Drug-Target Interactions Prediction at Scale: The Komet Algorithm with the LCIdb Dataset.

机构信息

Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France.

Institut Curie, Université PSL, 75005 Paris, France.

出版信息

J Chem Inf Model. 2024 Sep 23;64(18):6938-6956. doi: 10.1021/acs.jcim.4c00422. Epub 2024 Sep 5.

Abstract

Drug-target interactions (DTIs) prediction algorithms are used at various stages of the drug discovery process. In this context, specific problems such as deorphanization of a new therapeutic target or target identification of a drug candidate arising from phenotypic screens require large-scale predictions across the protein and molecule spaces. DTI prediction heavily relies on supervised learning algorithms that use known DTIs to learn associations between molecule and protein features, allowing for the prediction of new interactions based on learned patterns. The algorithms must be broadly applicable to enable reliable predictions, even in regions of the protein or molecule spaces where data may be scarce. In this paper, we address two key challenges to fulfill these goals: building large, high-quality training datasets and designing prediction methods that can scale, in order to be trained on such large datasets. First, we introduce LCIdb, a curated, large-sized dataset of DTIs, offering extensive coverage of both the molecule and druggable protein spaces. Notably, LCIdb contains a much higher number of molecules than publicly available benchmarks, expanding coverage of the molecule space. Second, we propose Komet (Kronecker Optimized METhod), a DTI prediction pipeline designed for scalability without compromising performance. Komet leverages a three-step framework, incorporating efficient computation choices tailored for large datasets and involving the Nyström approximation. Specifically, Komet employs a Kronecker interaction module for (molecule, protein) pairs, which efficiently captures determinants in DTIs, and whose structure allows for reduced computational complexity and quasi-Newton optimization, ensuring that the model can handle large training sets, without compromising on performance. Our method is implemented in open-source software, leveraging GPU parallel computation for efficiency. We demonstrate the interest of our pipeline on various datasets, showing that Komet displays superior scalability and prediction performance compared to state-of-the-art deep learning approaches. Additionally, we illustrate the generalization properties of Komet by showing its performance on an external dataset, and on the publicly available benchmark designed for scaffold hopping problems. Komet is available open source at https://komet.readthedocs.io and all datasets, including LCIdb, can be found at https://zenodo.org/records/10731712.

摘要

药物-靶点相互作用(DTI)预测算法在药物发现过程的各个阶段都有应用。在这种情况下,新治疗靶点的去孤儿化或表型筛选产生的药物候选物的靶点识别等具体问题需要在蛋白质和分子空间进行大规模预测。DTI 预测严重依赖于监督学习算法,这些算法使用已知的 DTI 来学习分子和蛋白质特征之间的关联,从而根据学习到的模式预测新的相互作用。这些算法必须具有广泛的适用性,以便即使在数据可能稀缺的蛋白质或分子空间区域也能进行可靠的预测。在本文中,我们解决了实现这些目标的两个关键挑战:构建大型、高质量的训练数据集和设计能够扩展的预测方法,以便可以在如此大型的数据上进行训练。首先,我们引入了 LCIdb,这是一个经过精心整理的、大型的 DTI 数据集,提供了对分子和可成药蛋白质空间的广泛覆盖。值得注意的是,LCIdb 包含的分子数量比公开可用的基准数据集多得多,从而扩大了分子空间的覆盖范围。其次,我们提出了 Komet(Kronecker Optimized METhod),这是一种专为可扩展性而设计的 DTI 预测管道,在不影响性能的情况下实现扩展。Komet 利用了一个三步骤框架,包括针对大数据集的高效计算选择,并涉及 Nyström 逼近。具体来说,Komet 为(分子,蛋白质)对采用 Kronecker 交互模块,该模块有效地捕获 DTI 中的决定因素,并且其结构允许降低计算复杂度和拟牛顿优化,确保模型可以处理大型训练集,而不会影响性能。我们的方法在开源软件中实现,利用 GPU 并行计算来提高效率。我们在各种数据集上展示了我们的管道的优势,表明 Komet 与最先进的深度学习方法相比具有优越的可扩展性和预测性能。此外,我们通过展示其在外部数据集和公开的用于支架跳跃问题的基准数据集上的性能,说明了 Komet 的泛化性质。Komet 可在 https://komet.readthedocs.io 上获得开源,并可在 https://zenodo.org/records/10731712 上找到所有数据集,包括 LCIdb。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa7/11423346/e648bb246238/ci4c00422_0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验