Oruç Tuğçe, Kadukova Maria, Davies Thomas G, Verdonk Marcel, Poelking Carl
Astex Pharmaceuticals, Cambridge, United Kingdom.
Bioinformatics. 2025 Jun 27;41(6). doi: 10.1093/bioinformatics/btaf284.
Binding sites are the key interfaces that determine a protein's biological activity, and therefore common targets for therapeutic intervention. Techniques that help us detect, compare and contextualise binding sites are hence of immense interest to drug discovery.
Here we present an approach that integrates protein language models with a 3D tessellation technique to derive rich and versatile representations of binding sites that combine functional, structural and evolutionary information with unprecedented detail. We demonstrate that the associated similarity metrics induce meaningful pocket clusterings by balancing local structure against global sequence effects. The resulting embeddings are shown to simplify a variety of downstream tasks: they help organise the "pocketome" in a way that efficiently contextualises new binding sites, construct performant druggability models, and define challenging train-test splits for believable benchmarking of pocket-centric machine-learning models.
A Python package that implements the EPoCS method is freely available at https://github.com/tugceoruc/epocs.
Supplementary data (extended figures and method details) are available at Bioinformatics online.
结合位点是决定蛋白质生物活性的关键界面,因此是治疗干预的常见靶点。有助于我们检测、比较和分析结合位点背景的技术,对药物发现具有极大的吸引力。
在此,我们提出一种方法,将蛋白质语言模型与三维镶嵌技术相结合,以获得丰富且通用的结合位点表示,该表示以前所未有的细节将功能、结构和进化信息结合在一起。我们证明,相关的相似性度量通过平衡局部结构与全局序列效应,诱导出有意义的口袋聚类。结果表明,所得的嵌入简化了各种下游任务:它们有助于以一种有效地分析新结合位点背景的方式组织“口袋组”,构建高性能的成药模型,并为以口袋为中心的机器学习模型的可信基准测试定义具有挑战性的训练-测试分割。
实现EPoCS方法的Python包可在https://github.com/tugceoruc/epocs上免费获取。
补充数据(扩展图和方法细节)可在《生物信息学》在线版上获取。