Suppr超能文献

CryptoBench:神秘蛋白质-配体结合位点数据集及基准测试

CryptoBench: cryptic protein-ligand binding sites dataset and benchmark.

作者信息

Škrhák Vít, Novotný Marian, Feidakis Christos P, Krivák Radoslav, Hoksza David

机构信息

Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic.

Department of Cell Biology, Faculty of Science, Charles University, 128 43 Prague, Czech Republic.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae745.

Abstract

MOTIVATION

Structure-based methods for detecting protein-ligand binding sites play a crucial role in various domains, from fundamental research to biomedical applications. However, current prediction methodologies often rely on holo (ligand-bound) protein conformations for training and evaluation, overlooking the significance of the apo (ligand-free) states. This oversight is particularly problematic in the case of cryptic binding sites (CBSs) where holo-based assessment yields unrealistic performance expectations.

RESULTS

To advance the development in this domain, we introduce CryptoBench, a benchmark dataset tailored for training and evaluating novel CBS prediction methodologies. CryptoBench is constructed upon a large collection of apo-holo protein pairs, grouped by UniProtID, clustered by sequence identity, and filtered to contain only structures with substantial structural change in the binding site. CryptoBench comprises 1107 structures with predefined cross-validation splits, making it the most extensive CBS dataset to date. To establish a performance baseline, we measured the predictive power of sequence- and structure-based CBS residue prediction methods using the benchmark. We selected PocketMiner as the state-of-the-art representative of the structure-based methods for CBS detection, and P2Rank, a widely-used structure-based method for general binding site prediction that is not specifically tailored for cryptic sites. For sequence-based approaches, we trained a neural network to classify binding residues using protein language model embeddings. Our sequence-based approach outperformed PocketMiner and P2Rank across key metrics, including area under the curve, area under the precision-recall curve, Matthew's correlation coefficient, and F1 scores. These results provide baseline benchmark results for future CBS and potentially also non-CBS prediction endeavors, leveraging CryptoBench as the foundational platform for further advancements in the field.

AVAILABILITY AND IMPLEMENTATION

The CryptoBench dataset, including the benchmark model, is available on Open Science Framework-https://osf.io/pz4a9/. The code and tutorial are available at the GitHub repository-https://github.com/skrhakv/CryptoBench/.

摘要

动机

基于结构的蛋白质-配体结合位点检测方法在从基础研究到生物医学应用的各个领域都发挥着关键作用。然而,当前的预测方法通常依赖于全酶(配体结合)蛋白构象进行训练和评估,而忽视了无配体(游离配体)状态的重要性。在隐秘结合位点(CBS)的情况下,这种忽视尤其成问题,因为基于全酶的评估会产生不切实际的性能期望。

结果

为了推动该领域的发展,我们引入了CryptoBench,这是一个专门用于训练和评估新型CBS预测方法的基准数据集。CryptoBench基于大量的无配体-全酶蛋白对构建,按UniProtID分组,按序列同一性聚类,并经过筛选,只包含结合位点有显著结构变化的结构。CryptoBench包含1107个具有预定义交叉验证分割的结构,使其成为迄今为止最广泛的CBS数据集。为了建立性能基线,我们使用该基准测量了基于序列和结构的CBS残基预测方法的预测能力。我们选择PocketMiner作为基于结构的CBS检测方法的当前最先进代表,以及P2Rank,一种广泛使用的基于结构的通用结合位点预测方法,该方法并非专门针对隐秘位点定制。对于基于序列的方法,我们训练了一个神经网络,使用蛋白质语言模型嵌入对结合残基进行分类。我们基于序列的方法在关键指标上优于PocketMiner和P2Rank,包括曲线下面积、精确召回曲线下面积、马修斯相关系数和F1分数。这些结果为未来的CBS以及潜在的非CBS预测工作提供了基线基准结果,利用CryptoBench作为该领域进一步发展的基础平台。

可用性和实现

CryptoBench数据集,包括基准模型,可在开放科学框架-https://osf.io/pz4a9/上获取。代码和教程可在GitHub仓库-https://github.com/skrhakv/CryptoBench/上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05ee/11725321/8d08a241c19e/btae745f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验