Ock Janghoon, Meda Radheesh Sharma, Badrinarayanan Srivathsan, Aluru Neha S, Chandrasekhar Achuth, Barati Farimani Amir
Department of Chemical Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania 15213, United States.
Department of Chemical and Biomolecular Engineering, University of Nebraska─Lincoln, Lincoln, Nebraska 68588, United States.
J Chem Inf Model. 2026 Feb 23;66(4):2055-2068. doi: 10.1021/acs.jcim.5c02454. Epub 2026 Feb 9.
We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, literature-grounded question answering via retrieval-augmented generation, molecular generation, multiproperty prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. The agent autonomously retrieves relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answers mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generates chemically diverse seed molecules and predicted 75 properties, including ADMET-related and general physicochemical descriptors, which guids iterative molecular refinement. Across two refinement rounds, the number of molecules with QED >0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
我们提出了一个由大语言模型(LLMs)驱动的模块化框架,该框架可自动执行并简化早期计算药物发现流程中的关键任务。通过将大语言模型推理与特定领域工具相结合,该框架可进行生物医学数据检索、通过检索增强生成进行基于文献的问答、分子生成、多属性预测、属性感知分子优化以及三维蛋白质-配体结构生成。该智能体可自主检索相关生物分子信息,包括FASTA序列、SMILES表示和文献,并与标准大语言模型相比,以更高的上下文准确性回答机理问题。然后,它生成化学性质多样的种子分子并预测75种属性,包括与ADMET相关的和一般物理化学描述符,这些属性指导迭代分子优化。在两轮优化中,QED>0.6的分子数量从34个增加到55个。满足经验性类药过滤器的分子数量也有所增加;例如,在100个分子的集合中,符合Ghose过滤器的分子数量从32个增加到55个。该框架还采用Boltz-2生成三维蛋白质-配体复合物,并为候选化合物提供快速结合亲和力估计。这些结果表明,该方法有效地支持了分子筛选、优先级排序和结构评估。其模块化设计能够灵活集成不断发展的工具和模型,为人工智能辅助治疗发现提供了一个可扩展的基础。