Department of Systems Biology, Columbia University Irving Medical Center, New York, New York, USA.
Integrated Graduate Program in Cellular, Molecular and Biomedical Studies (CMBS), Columbia University Irving Medical Center, New York, New York, USA.
Protein Sci. 2023 Apr;32(4):e4594. doi: 10.1002/pro.4594.
We describe the Predicting Protein-Compound Interactions (PrePCI) database which comprises over 5 billion predicted interactions between 6.8 million chemical compounds and 19,797 human proteins. PrePCI relies on a proteome-wide database of structural models based on both traditional modeling techniques and the AlphaFold Protein Structure Database. Sequence- and structural similarity-based metrics are established between template proteins, T, in the Protein Data Bank that bind compounds, C, and query proteins in the model database, Q. When the metrics exceed threshold values, it is assumed that C also binds to Q with a likelihood ratio (LR) derived from machine learning. If the relationship is based on structural similarity, the LR is based on a scoring function that measures the extent to which C is compatible with the binding site of Q as described in the LT-scanner algorithm. For every predicted complex derived in this way, chemical similarity based on the Tanimoto coefficient identifies other small molecules that may bind to Q. An overall LR for the binding of C to Q is obtained from Naive Bayesian statistics. The PrePCI database can be queried by entering a UniProt ID or gene name for a protein to obtain a list of compounds predicted to bind to it along with associated LRs. Alternatively, entering an identifier for the compound outputs a list of proteins it is predicted to bind. Specific applications of the database to lead discovery, elucidation of drug mechanism of action, and biological function annotation are described.
我们描述了 Predicting Protein-Compound Interactions (PrePCI) 数据库,其中包含超过 50 亿个预测的化合物-蛋白质相互作用,涉及 680 万个化合物和 19797 个人类蛋白质。PrePCI 依赖于基于传统建模技术和 AlphaFold 蛋白质结构数据库的蛋白质组范围的结构模型数据库。在包含结合化合物的模板蛋白质 T 的蛋白质数据库中建立了序列和结构相似性度量标准,以及查询蛋白质 Q。当度量标准超过阈值时,假定 C 也以机器学习得出的似然比 (LR) 与 Q 结合。如果这种关系基于结构相似性,则 LR 基于评分函数,该函数衡量 C 在 LT-scanner 算法中描述的 Q 结合位点的兼容性程度。通过这种方式衍生的每一个预测复合物,基于 Tanimoto 系数的化学相似性确定其他可能与 Q 结合的小分子。通过朴素贝叶斯统计获得 C 与 Q 结合的总体 LR。可以通过输入蛋白质的 UniProt ID 或基因名称来查询 PrePCI 数据库,以获取预测与其结合的化合物列表以及相关的 LR。或者,输入化合物的标识符可输出预测其结合的蛋白质列表。描述了该数据库在发现先导化合物、阐明药物作用机制和生物功能注释方面的具体应用。