Stratiichuk Roman, Melnychenko Mykola, Koleiev Ihor, Voitsitskyi Taras, Husak Vladyslav, Shevchuk Nazar, Ostrovsky Zakhar, Bdzhola Volodymyr, Yesylevskyy Semen, Starosyla Serhii, Nafiiev Alan
Receptor.AI Inc., London N1 7GU, United Kingdom.
Department of Biophysics and Medical Informatics, Educational and Scientific Centre "Іnstitute of Biology and Medicine", Taras Shevchenko Kyiv National University, Kyiv 01601, Ukraine.
Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf449.
Accurately identifying and prioritizing protein binding pockets is a foundational element of small-molecule drug discovery. Defining these known pockets currently relies on a laborious manual process of extracting key residue data from selected publications, reconciling inconsistent terminology, and independently computing volumetric representations. This manual curation to ensure biological relevance is time-consuming, error-prone, and represents a major bottleneck for efficient, high-throughput drug discovery.
We present a novel approach for the identification and prioritization of protein binding pockets for small molecules by combining geometric pocket detection with large language models (LLMs). Our method leverages Fpocket to generate candidate pockets, which are then validated against published experimental data extracted from research articles using LLM with a series of prompts fine-tuned to identify and extract residue-level information associated with experimentally confirmed binding sites. We developed a curated benchmark dataset of diverse proteins and associated literature to train and evaluate the LLM's performance in paper relevance assessment and pocket extraction.
The developed benchmark dataset and methodology are freely available at the GitHub repository (https://github.com/receptor-ai/LLM-benchmark-dataset) and Zenodo (DOI: 10.5281/zenodo.15798647).
准确识别蛋白质结合口袋并对其进行优先级排序是小分子药物发现的基础要素。目前,定义这些已知口袋依赖于一个繁琐的手动过程,即从选定的出版物中提取关键残基数据、协调不一致的术语,并独立计算体积表示。这种为确保生物学相关性而进行的人工整理既耗时又容易出错,并且是高效、高通量药物发现的主要瓶颈。
我们提出了一种通过结合几何口袋检测和大语言模型(LLMs)来识别小分子蛋白质结合口袋并对其进行优先级排序的新方法。我们的方法利用Fpocket生成候选口袋,然后使用大语言模型根据从研究文章中提取的已发表实验数据对其进行验证,该大语言模型带有一系列经过微调的提示,以识别和提取与实验确认的结合位点相关的残基水平信息。我们开发了一个经过整理的包含多种蛋白质和相关文献的基准数据集,以训练和评估大语言模型在论文相关性评估和口袋提取方面的性能。
开发的基准数据集和方法可在GitHub存储库(https://github.com/receptor-ai/LLM-benchmark-dataset)和Zenodo(DOI:10.5281/zenodo.15798647)上免费获取。