• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

论切实验证分子生成模型的难度:基于公共数据和专有数据的案例研究

On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data.

作者信息

Handa Koichi, Thomas Morgan C, Kageyama Michiharu, Iijima Takeshi, Bender Andreas

机构信息

Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.

Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan.

出版信息

J Cheminform. 2023 Nov 21;15(1):112. doi: 10.1186/s13321-023-00781-1.

DOI:10.1186/s13321-023-00781-1
PMID:37990215
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10664602/
Abstract

While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively.Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development.

摘要

尽管最近出现了大量深度生成模型,但对于它们在实际相关验证方面并没有最佳实践方法。一方面,新生成的分子无法通过回顾性验证被反驳(因此这种验证类型存在偏差);但另一方面,前瞻性验证成本高昂,且往往会受到人为选择过程的影响而产生偏差。在本案例研究中,我们将回顾性验证定义为模仿人类药物设计的能力,通过回答以下问题:在早期项目化合物上训练的生成模型能否从头生成中期/后期化合物?为此,我们使用了实验数据,这些数据包含了从五个公开数据集(其中时间序列经过预处理以更好地反映实际合成扩展)和六个内部项目数据集中识别出命中化合物后合成扩展所经过的时间,并使用REINVENT作为一种广泛采用的基于循环神经网络的生成模型。在将数据集拆分并在早期化合物上训练REINVENT之后,我们发现公开项目中中期/后期化合物的重新发现率(在前100、500和5000个评分最高的生成化合物中分别为1.60%、0.64%和0.21%)远高于内部项目(相应的值分别为0.00%、0.03%和0.04%)。同样,公开项目中早期与中期/后期活性化合物之间的平均单最近邻相似度高于非活性化合物之间的相似度;然而,对于内部项目,情况则相反,这使得重新发现(如果需要的话)更加困难。因此,我们表明生成模型从实际药物发现项目中重新发现的中期/后期化合物非常少,突出了纯算法设计与作为实际过程的药物发现之间的根本差异。基于当前研究,评估从头化合物设计方法似乎很难甚至不可能通过回顾性进行。科学贡献 因此,本贡献阐述了在实际环境中评估生成模型性能的一些方面,这些方面此前尚未得到广泛描述,希望能为其未来的进一步发展做出贡献。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/028f7b9d025f/13321_2023_781_Fig8a_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/2f9dd0964efe/13321_2023_781_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/69d8978d9c57/13321_2023_781_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/83d60156600a/13321_2023_781_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/6b0ce7c6b556/13321_2023_781_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/f133e3b97e97/13321_2023_781_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/3443c66dc897/13321_2023_781_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/e591f3a777e9/13321_2023_781_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/028f7b9d025f/13321_2023_781_Fig8a_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/2f9dd0964efe/13321_2023_781_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/69d8978d9c57/13321_2023_781_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/83d60156600a/13321_2023_781_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/6b0ce7c6b556/13321_2023_781_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/f133e3b97e97/13321_2023_781_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/3443c66dc897/13321_2023_781_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/e591f3a777e9/13321_2023_781_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e227/10664602/028f7b9d025f/13321_2023_781_Fig8a_HTML.jpg

相似文献

1
On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data.论切实验证分子生成模型的难度:基于公共数据和专有数据的案例研究
J Cheminform. 2023 Nov 21;15(1):112. doi: 10.1186/s13321-023-00781-1.
2
Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study.深度生成模型中基于结构和配体的评分函数比较:以G蛋白偶联受体为例的研究
J Cheminform. 2021 May 13;13(1):39. doi: 10.1186/s13321-021-00516-0.
3
Comprehensive assessment of deep generative architectures for de novo drug design.从头设计药物的深度生成式架构的综合评估。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab544.
4
Fine-tuning of a generative neural network for designing multi-target compounds.生成式神经网络的微调用于设计多靶化合物。
J Comput Aided Mol Des. 2022 May;36(5):363-371. doi: 10.1007/s10822-021-00392-8. Epub 2021 May 28.
5
Generative Adversarial Networks for De Novo Molecular Design.生成对抗网络用于从头分子设计。
Mol Inform. 2021 Oct;40(10):e2100045. doi: 10.1002/minf.202100045. Epub 2021 Jul 6.
6
Structure-based drug design using 3D deep generative models.使用3D深度生成模型的基于结构的药物设计。
Chem Sci. 2021 Sep 9;12(41):13664-13675. doi: 10.1039/d1sc04444c. eCollection 2021 Oct 27.
7
Deep Learning Applied to Ligand-Based De Novo Drug Design.深度学习在配体的从头药物设计中的应用。
Methods Mol Biol. 2022;2390:273-299. doi: 10.1007/978-1-0716-1787-8_12.
8
cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation.cMolGPT:一种用于靶向特定从头分子生成的条件生成式预训练转换器。
Molecules. 2023 May 30;28(11):4430. doi: 10.3390/molecules28114430.
9
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
10
Generative Deep Learning for Targeted Compound Design.生成式深度学习在靶向化合物设计中的应用。
J Chem Inf Model. 2021 Nov 22;61(11):5343-5361. doi: 10.1021/acs.jcim.0c01496. Epub 2021 Oct 26.

引用本文的文献

1
Generative Deep Learning for de Novo Drug Design─A Chemical Space Odyssey.用于从头药物设计的生成式深度学习——一场化学空间奥德赛。
J Chem Inf Model. 2025 Jul 28;65(14):7352-7372. doi: 10.1021/acs.jcim.5c00641. Epub 2025 Jul 9.
2
DrugSynthMC: An Atom-Based Generation of Drug-like Molecules with Monte Carlo Search.DrugSynthMC:基于原子的药物分子生成与蒙特卡罗搜索。
J Chem Inf Model. 2024 Sep 23;64(18):7097-7107. doi: 10.1021/acs.jcim.4c01451. Epub 2024 Sep 9.
3
Adapt-cMolGPT: A Conditional Generative Pre-Trained Transformer with Adapter-Based Fine-Tuning for Target-Specific Molecular Generation.

本文引用的文献

1
GENERA: A Combined Genetic/Deep-Learning Algorithm for Multiobjective Target-Oriented De Novo Design.GENERA:一种用于多目标导向从头设计的遗传/深度学习联合算法。
J Chem Inf Model. 2023 Aug 28;63(16):5107-5119. doi: 10.1021/acs.jcim.3c00963. Epub 2023 Aug 9.
2
Faster and more diverse de novo molecular optimization with double-loop reinforcement learning using augmented SMILES.使用增强型 SMILES 进行双环强化学习,实现更快、更多样的从头分子优化。
J Comput Aided Mol Des. 2023 Aug;37(8):373-394. doi: 10.1007/s10822-023-00512-6. Epub 2023 Jun 17.
3
Evaluation guidelines for machine learning tools in the chemical sciences.
自适应-cMolGPT:基于适配器的条件生成式预训练转换器,用于特定目标的分子生成微调。
Int J Mol Sci. 2024 Jun 17;25(12):6641. doi: 10.3390/ijms25126641.
4
DrugGym: A testbed for the economics of autonomous drug discovery.DrugGym:自主药物研发经济学的试验平台。
bioRxiv. 2024 Jun 2:2024.05.28.596296. doi: 10.1101/2024.05.28.596296.
5
The AI-driven Drug Design (AIDD) platform: an interactive multi-parameter optimization system integrating molecular evolution with physiologically based pharmacokinetic simulations.人工智能驱动的药物设计 (AIDD) 平台:一个集成分子进化与基于生理的药代动力学模拟的交互式多参数优化系统。
J Comput Aided Mol Des. 2024 Mar 19;38(1):14. doi: 10.1007/s10822-024-00552-6.
机器学习工具在化学科学中的评价指南。
Nat Rev Chem. 2022 Jun;6(6):428-442. doi: 10.1038/s41570-022-00391-9. Epub 2022 May 24.
4
25 Years of Small-Molecule Optimization at Novartis: A Retrospective Analysis of Chemical Series Evolution.诺华公司25年小分子药物优化历程:化学系列演变的回顾性分析
J Chem Inf Model. 2022 Dec 12;62(23):6002-6021. doi: 10.1021/acs.jcim.2c00785. Epub 2022 Nov 9.
5
Drug Design Using Reinforcement Learning with Graph-Based Deep Generative Models.基于图的深度生成模型的强化学习药物设计。
J Chem Inf Model. 2022 Oct 24;62(20):4863-4872. doi: 10.1021/acs.jcim.2c00838. Epub 2022 Oct 11.
6
Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation.增强爬山算法提高了基于语言的从头分子生成的强化学习效率。
J Cheminform. 2022 Oct 3;14(1):68. doi: 10.1186/s13321-022-00646-z.
7
Transformer-based molecular optimization beyond matched molecular pairs.超越匹配分子对的基于Transformer的分子优化。
J Cheminform. 2022 Mar 28;14(1):18. doi: 10.1186/s13321-022-00599-3.
8
DockStream: a docking wrapper to enhance de novo molecular design.DockStream:一种用于增强从头分子设计的对接包装程序。
J Cheminform. 2021 Nov 17;13(1):89. doi: 10.1186/s13321-021-00563-7.
9
Applications of Artificial Intelligence in Drug Design: Opportunities and Challenges.人工智能在药物设计中的应用:机遇与挑战。
Methods Mol Biol. 2022;2390:1-59. doi: 10.1007/978-1-0716-1787-8_1.
10
Generative Deep Learning for Targeted Compound Design.生成式深度学习在靶向化合物设计中的应用。
J Chem Inf Model. 2021 Nov 22;61(11):5343-5361. doi: 10.1021/acs.jcim.0c01496. Epub 2021 Oct 26.