• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习的虚拟筛选中负训练集大小的影响。

The influence of negative training set size on machine learning-based virtual screening.

机构信息

Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Poland.

Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Poland ; Faculty of Chemistry, Jagiellonian University, R. Ingardena 3, 30-060 Kraków, Poland.

出版信息

J Cheminform. 2014 Jun 11;6:32. doi: 10.1186/1758-2946-6-32. eCollection 2014.

DOI:10.1186/1758-2946-6-32
PMID:24976867
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4061540/
Abstract

BACKGROUND

The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.

RESULTS

The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.

CONCLUSIONS

In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.

摘要

背景

本文深入分析了负例训练数量对机器学习方法性能的影响。

结果

本文研究了从 ZINC 数据库中随机选择的固定数量正例和数量可变的负例集合中,应用机器学习方法时这一相当被忽视的方面的影响。在模拟虚拟筛选实验中,我们发现正例与负例训练实例的比例增加会极大地影响大多数被调查的 ML 方法的评估参数。在大多数情况下,精度和 MCC 都有显著提高,而命中率有所下降。对这些变化的动态分析使我们能够推荐出最佳的训练数据组合。本研究在几个蛋白质靶标上进行,使用了 5 种机器学习算法(SMO、朴素贝叶斯、Ibk、J48 和随机森林)和 2 种分子指纹(MACCS 和 CDK FP)。CDK FP 与 SMO 或随机森林算法相结合的分类效果最佳。朴素贝叶斯模型似乎对训练集中负例数量的变化不太敏感。

结论

总之,在准备机器学习实验时应考虑正例与负例训练实例的比例,因为它可能会显著影响特定分类器的性能。此外,还可以将负例训练集大小的优化作为基于机器学习的虚拟筛选中的一种提升方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/f56fb97049f0/1758-2946-6-32-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/b9cc872a55f5/1758-2946-6-32-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/d054ce7b2423/1758-2946-6-32-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/f56fb97049f0/1758-2946-6-32-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/b9cc872a55f5/1758-2946-6-32-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/d054ce7b2423/1758-2946-6-32-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7971/4061540/f56fb97049f0/1758-2946-6-32-3.jpg

相似文献

1
The influence of negative training set size on machine learning-based virtual screening.基于机器学习的虚拟筛选中负训练集大小的影响。
J Cheminform. 2014 Jun 11;6:32. doi: 10.1186/1758-2946-6-32. eCollection 2014.
2
The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening.正负比例和筛选数据库大小对基于机器学习的虚拟筛选性能的影响。
PLoS One. 2017 Apr 6;12(4):e0175410. doi: 10.1371/journal.pone.0175410. eCollection 2017.
3
In-silico predictive mutagenicity model generation using supervised learning approaches.基于监督学习方法的计算机预测致突变性模型生成。
J Cheminform. 2012 May 15;4(1):10. doi: 10.1186/1758-2946-4-10.
4
Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds.不同训练集组成和大小对基于支持向量机的活性化合物预测的影响。
J Chem Inf Model. 2017 Apr 24;57(4):710-716. doi: 10.1021/acs.jcim.7b00088. Epub 2017 Apr 10.
5
Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening.基于支持向量机的虚拟筛选中,将确证无活性化合物与随机选择的化合物进行比较,作为负训练实例。
J Chem Inf Model. 2013 Jul 22;53(7):1595-601. doi: 10.1021/ci4002712. Epub 2013 Jul 3.
6
The influence of the inactives subset generation on the performance of machine learning methods.非活性子集生成对机器学习方法性能的影响。
J Cheminform. 2013 Apr 5;5(1):17. doi: 10.1186/1758-2946-5-17.
7
Computational models for the classification of mPGES-1 inhibitors with fingerprint descriptors.基于指纹描述符的 mPGES-1 抑制剂分类的计算模型。
Mol Divers. 2017 Aug;21(3):661-675. doi: 10.1007/s11030-017-9743-x. Epub 2017 May 8.
8
Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets.机器学习算法在当代生物靶标中的容错性。
Molecules. 2019 Jun 4;24(11):2115. doi: 10.3390/molecules24112115.
9
Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?敲击黑箱:机器学习评分函数的评分能力如何依赖于训练集?
J Chem Inf Model. 2020 Mar 23;60(3):1122-1136. doi: 10.1021/acs.jcim.9b00714. Epub 2020 Mar 3.
10
A comparison of machine learning and Bayesian modelling for molecular serotyping.机器学习与贝叶斯建模用于分子血清分型的比较
BMC Genomics. 2017 Aug 11;18(1):606. doi: 10.1186/s12864-017-3998-6.

引用本文的文献

1
Unravelling the human taste receptor interactome: machine learning and molecular modelling insights into protein-protein interactions.解析人类味觉受体相互作用组:机器学习与蛋白质-蛋白质相互作用的分子建模见解
NPJ Sci Food. 2025 Jul 1;9(1):113. doi: 10.1038/s41538-025-00478-9.
2
Redesigning plant specialized metabolism with supervised machine learning using publicly available reactome data.利用公开可用的Reactome数据,通过监督式机器学习重新设计植物特殊代谢。
Comput Struct Biotechnol J. 2023 Jan 18;21:1639-1650. doi: 10.1016/j.csbj.2023.01.013. eCollection 2023.
3
Prediction of Work-Related Risk Factors among Bus Drivers Using Machine Learning.

本文引用的文献

1
Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening.基于支持向量机的虚拟筛选中,将确证无活性化合物与随机选择的化合物进行比较,作为负训练实例。
J Chem Inf Model. 2013 Jul 22;53(7):1595-601. doi: 10.1021/ci4002712. Epub 2013 Jul 3.
2
The influence of the inactives subset generation on the performance of machine learning methods.非活性子集生成对机器学习方法性能的影响。
J Cheminform. 2013 Apr 5;5(1):17. doi: 10.1186/1758-2946-5-17.
3
ZINC: a free tool to discover chemistry for biology.
基于机器学习的公交车驾驶员工作相关风险因素预测。
Int J Environ Res Public Health. 2022 Nov 17;19(22):15179. doi: 10.3390/ijerph192215179.
4
wSDTNBI: a novel network-based inference method for virtual screening.wSDTNBI:一种用于虚拟筛选的基于网络的新型推理方法。
Chem Sci. 2021 Dec 21;13(4):1060-1079. doi: 10.1039/d1sc05613a. eCollection 2022 Jan 26.
5
Artificial Intelligence in Functional Food Ingredient Discovery and Characterisation: A Focus on Bioactive Plant and Food Peptides.人工智能在功能性食品成分发现与表征中的应用:聚焦生物活性植物肽和食物肽
Front Genet. 2021 Nov 19;12:768979. doi: 10.3389/fgene.2021.768979. eCollection 2021.
6
Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.概率随机森林通过考虑实验不确定性,改进了接近分类阈值的生物活性预测。
J Cheminform. 2021 Aug 19;13(1):62. doi: 10.1186/s13321-021-00539-7.
7
The role of artificial intelligence in the battle against antimicrobial-resistant bacteria.人工智能在对抗抗微生物药物耐药菌中的作用。
Curr Genet. 2021 Jun;67(3):421-429. doi: 10.1007/s00294-021-01156-5. Epub 2021 Feb 13.
8
Property-Unmatched Decoys in Docking Benchmarks.对接基准测试中的属性不匹配诱饵。
J Chem Inf Model. 2021 Feb 22;61(2):699-714. doi: 10.1021/acs.jcim.0c00598. Epub 2021 Jan 25.
9
Use of machine learning in geriatric clinical care for chronic diseases: a systematic literature review.机器学习在老年慢性病临床护理中的应用:一项系统文献综述
JAMIA Open. 2020 Oct 8;3(3):459-471. doi: 10.1093/jamiaopen/ooaa034. eCollection 2020 Oct.
10
Computational Drug Repurposing Algorithm Targeting TRPA1 Calcium Channel as a Potential Therapeutic Solution for Multiple Sclerosis.靶向TRPA1钙通道的计算药物重新利用算法作为多发性硬化症的潜在治疗方案
Pharmaceutics. 2019 Sep 2;11(9):446. doi: 10.3390/pharmaceutics11090446.
ZINC:一款用于生物学的免费化学发现工具。
J Chem Inf Model. 2012 Jul 23;52(7):1757-68. doi: 10.1021/ci3001277. Epub 2012 Jun 15.
4
PubChem's BioAssay Database.PubChem 的生物测定数据库。
Nucleic Acids Res. 2012 Jan;40(Database issue):D400-12. doi: 10.1093/nar/gkr1132. Epub 2011 Dec 2.
5
ChEMBL: a large-scale bioactivity database for drug discovery.ChEMBL:用于药物发现的大型生物活性数据库。
Nucleic Acids Res. 2012 Jan;40(Database issue):D1100-7. doi: 10.1093/nar/gkr777. Epub 2011 Sep 23.
6
PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.PaDEL-descriptor:一个开源软件,可用于计算分子描述符和指纹。
J Comput Chem. 2011 May;32(7):1466-74. doi: 10.1002/jcc.21707. Epub 2010 Dec 17.
7
Comparative analysis of machine learning methods in ligand-based virtual screening of large compound libraries.基于配体的大型化合物库虚拟筛选中机器学习方法的比较分析
Comb Chem High Throughput Screen. 2009 May;12(4):344-57. doi: 10.2174/138620709788167944.
8
Machine learning in virtual screening.虚拟筛选中的机器学习
Comb Chem High Throughput Screen. 2009 May;12(4):332-43. doi: 10.2174/138620709788167980.
9
Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds.稀疏分布活性化合物训练的支持向量机虚拟筛选性能评估。
J Chem Inf Model. 2008 Jun;48(6):1227-37. doi: 10.1021/ci800022e. Epub 2008 Jun 6.
10
Support vector inductive logic programming outperforms the naive Bayes classifier and inductive logic programming for the classification of bioactive chemical compounds.支持向量归纳逻辑编程在生物活性化合物分类方面优于朴素贝叶斯分类器和归纳逻辑编程。
J Comput Aided Mol Des. 2007 May;21(5):269-80. doi: 10.1007/s10822-007-9113-3. Epub 2007 Mar 27.