Deng Yifan, Ericksen Spencer S, Gitter Anthony
ArXiv. 2025 Jul 16:arXiv:2507.12574v1.
Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.
科学数据库汇总了大量定量数据以及描述性文本。在生物化学中,分子筛选测定评估候选分子针对疾病靶点的功能反应。描述这些靶点作用的生物学机制、实验筛选方案以及测定的其他属性的非结构化文本,为新药研发活动提供了丰富信息,但由于其非结构化格式而未被利用。我们提出了Assay2Mol,这是一种基于大语言模型的工作流程,可利用现有的大量生化筛选测定进行早期药物发现。Assay2Mol检索涉及与新靶点相似靶点的现有测定记录,并使用检索到的测定筛选数据通过上下文学习生成候选分子。Assay2Mol优于最近为目标蛋白质结构生成候选配体分子的机器学习方法,同时还促进了更多可合成分子的生成。