蛋白质描述符集在定量构效关系建模中的基准测试（第 2 部分）：13 种氨基酸描述符集的建模性能。

Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets.

机构信息

Division of Medicinal Chemistry, Leiden / Amsterdam Center for Drug Research, Einsteinweg 55, Leiden 2333, CC, The Netherlands.

出版信息

J Cheminform. 2013 Sep 24;5(1):42. doi: 10.1186/1758-2946-5-42.

DOI:10.1186/1758-2946-5-42

PMID:24059743

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4015169/

Abstract

BACKGROUND

While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants.

RESULTS

The amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last.

CONCLUSIONS

While amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still - on average - surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side.

摘要

背景

尽管已经有大量关于比较和基准化分子结构描述符的工作，但缺乏对蛋白质描述符集的类似比较。因此，在当前的工作中，总共对 13 种氨基酸描述符集进行了基准测试，以评估它们建立生物活性模型的能力。研究中包含的描述符集有 Z 标度（3 种变体）、VHSE、T 标度、ST 标度、MS-WHIM、FASGAI、BLOSUM、一种新的蛋白质描述符集（称为 ProtFP（4 种变体）），此外，我们还创建并基准测试了三对描述符组合。预测性能在七个结构-活性基准中进行了评估，这些基准包括血管紧张素转换酶（ACE）二肽抑制剂数据，以及三个基于结构的定量构效关系（QSAR）数据集，即（1）针对 G 蛋白偶联受体（GPCR）面板建模的 GPCR 配体，（2）与一组 HIV 酶突变体相关的具有生物活性的非核苷类逆转录酶抑制剂（NNRTIs），以及（3）与一组 HIV 酶突变体相关的具有生物活性的酶抑制剂（PIs）。

结果

这里比较的氨基酸描述符集表现相似（<0.1 个对数单位 RMSE 差异和<0.1 个 MCC 差异），而在某些情况下，单个蛋白质的误差发现比描述符集差异更大（>0.3 个对数单位 RMSE 差异和>0.7 个 MCC 差异）。组合不同的描述符集通常比使用单个描述符集能获得更好的建模性能。表现最好的是 Z 标度（3）与 ProtFP（特征）组合，或 Z 标度（3）与每个目标的平均 Z 标度值组合，而 ProtFP（PCA8）、ST 标度和 ProtFP（特征）则排名最后。

结论

尽管氨基酸描述符集捕获了氨基酸的不同方面，但它们在生物活性建模中的应用能力仍然——平均而言——非常相似。不过，组合描述互补信息的描述符集始终能带来小而一致的建模性能提升（平均 MCC 提高 0.01，平均 RMSE 降低 0.01 个对数单位）。最后，所比较的靶标之间存在性能差异，这强调了选择合适的描述符集对于生物活性建模至关重要，无论是从配体还是蛋白质方面来说。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/75f8/4015169/b4fb3393c632/1758-2946-5-42-1.jpg

相似文献

Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets.

J Cheminform. 2013 Sep 24;5(1):42. doi: 10.1186/1758-2946-5-42.

Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets.

J Cheminform. 2013 Sep 23;5(1):41. doi: 10.1186/1758-2946-5-41.

Comprehensive comparison of twenty structural characterization scales applied as QSAM of antimicrobial dodecapeptides derived from Bac2A against P. aeruginosa.

J Mol Graph Model. 2017 Jan;71:88-95. doi: 10.1016/j.jmgm.2016.11.003. Epub 2016 Nov 9.

Efficacy of different protein descriptors in predicting protein functional families.

BMC Bioinformatics. 2007 Aug 17;8:300. doi: 10.1186/1471-2105-8-300.

Data on the sequence-derived properties of gastric cancer - binding peptides.

Data Brief. 2020 Feb 29;29:105351. doi: 10.1016/j.dib.2020.105351. eCollection 2020 Apr.

3DDPDs: describing protein dynamics for proteochemometric bioactivity prediction. A case for (mutant) G protein-coupled receptors.

J Cheminform. 2023 Aug 28;15(1):74. doi: 10.1186/s13321-023-00745-5.

Kernel Target Alignment Parameter: A New Modelability Measure for Regression Tasks.

J Chem Inf Model. 2016 Jan 25;56(1):6-11. doi: 10.1021/acs.jcim.5b00539. Epub 2015 Dec 23.

How diverse are diversity assessment methods? A comparative analysis and benchmarking of molecular descriptor space.

J Chem Inf Model. 2014 Jan 27;54(1):230-42. doi: 10.1021/ci400469u. Epub 2013 Dec 13.

QSAR--how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets.

J Chem Inf Model. 2006 Sep-Oct;46(5):1924-36. doi: 10.1021/ci050413p.

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling.

BMC Bioinformatics. 2005 Mar 10;6:50. doi: 10.1186/1471-2105-6-50.

引用本文的文献

TCR-H: explainable machine learning prediction of T-cell receptor epitope binding on unseen datasets.

Front Immunol. 2024 Aug 16;15:1426173. doi: 10.3389/fimmu.2024.1426173. eCollection 2024.

ProteoMutaMetrics: machine learning approaches for solute carrier family 6 mutation pathogenicity prediction.

RSC Adv. 2024 Apr 22;14(19):13083-13094. doi: 10.1039/d4ra00748d.

Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review.

Molecules. 2023 Nov 30;28(23):7865. doi: 10.3390/molecules28237865.

Deep learning tools to accelerate antibiotic discovery.

Expert Opin Drug Discov. 2023 Jul-Dec;18(11):1245-1257. doi: 10.1080/17460441.2023.2250721. Epub 2023 Oct 18.

3DDPDs: describing protein dynamics for proteochemometric bioactivity prediction. A case for (mutant) G protein-coupled receptors.

J Cheminform. 2023 Aug 28;15(1):74. doi: 10.1186/s13321-023-00745-5.

Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning.

Int J Mol Sci. 2023 Jul 29;24(15):12144. doi: 10.3390/ijms241512144.

Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning.

Cell Host Microbe. 2023 Aug 9;31(8):1260-1274.e6. doi: 10.1016/j.chom.2023.07.001. Epub 2023 Jul 28.

CalcAMP: A New Machine Learning Model for the Accurate Prediction of Antimicrobial Activity of Peptides.

Antibiotics (Basel). 2023 Apr 7;12(4):725. doi: 10.3390/antibiotics12040725.

Proteochemometric Modeling Identifies Chemically Diverse Norepinephrine Transporter Inhibitors.

J Chem Inf Model. 2023 Mar 27;63(6):1745-1755. doi: 10.1021/acs.jcim.2c01645. Epub 2023 Mar 16.

PSnpBind-ML: predicting the effect of binding site mutations on protein-ligand binding affinity.

J Cheminform. 2023 Mar 2;15(1):31. doi: 10.1186/s13321-023-00701-3.

本文引用的文献

Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets.

J Cheminform. 2013 Sep 23;5(1):41. doi: 10.1186/1758-2946-5-41.

Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening.

J Chem Inf Model. 2013 Jul 22;53(7):1595-601. doi: 10.1021/ci4002712. Epub 2013 Jul 3.

A structural chemogenomics analysis of aminergic GPCRs: lessons for histamine receptor ligand design.

Br J Pharmacol. 2013 Sep;170(1):101-26. doi: 10.1111/bph.12248.

Update of the drug resistance mutations in HIV-1: March 2013.

Top Antivir Med. 2013 Feb-Mar;21(1):6-14.

Genome-scale screening of drug-target associations relevant to Ki using a chemogenomics approach.

PLoS One. 2013;8(4):e57680. doi: 10.1371/journal.pone.0057680. Epub 2013 Apr 5.

Structure-based identification of OATP1B1/3 inhibitors.

Mol Pharmacol. 2013 Jun;83(6):1257-67. doi: 10.1124/mol.112.084152. Epub 2013 Apr 9.

Significantly improved HIV inhibitor efficacy prediction employing proteochemometric models generated from antivirogram data.

PLoS Comput Biol. 2013;9(2):e1002899. doi: 10.1371/journal.pcbi.1002899. Epub 2013 Feb 21.

propy: a tool to generate various modes of Chou's PseAAC.

Bioinformatics. 2013 Apr 1;29(7):960-2. doi: 10.1093/bioinformatics/btt072. Epub 2013 Feb 19.

A ligand's-eye view of protein similarity.

Nat Methods. 2013 Feb;10(2):116-7. doi: 10.1038/nmeth.2339.

Small and colorful stones make beautiful mosaics: fragment-based chemogenomics.

Drug Discov Today. 2013 Apr;18(7-8):323-30. doi: 10.1016/j.drudis.2012.12.003. Epub 2012 Dec 22.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

蛋白质描述符集在定量构效关系建模中的基准测试（第 2 部分）：13 种氨基酸描述符集的建模性能。

Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets.

机构信息

Division of Medicinal Chemistry, Leiden / Amsterdam Center for Drug Research, Einsteinweg 55, Leiden 2333, CC, The Netherlands.