生物测定数据的虚拟筛选。

Virtual screening of bioassay data.

机构信息

Smart Technology Research Centre, Bournemouth University, Poole House, Talbot Campus, Poole, Dorset, BH12 5BB, UK.

出版信息

J Cheminform. 2009 Dec 22;1:21. doi: 10.1186/1758-2946-1-21.

DOI:10.1186/1758-2946-1-21

PMID:20150999

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2820499/

Abstract

BACKGROUND

There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.

RESULTS

Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.

CONCLUSIONS

Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.

摘要

背景

虚拟筛选生物测定数据存在三个主要问题。第一个是获取免费的经过整理的数据，第二个是物理初级筛选过程中出现的大量假阳性，最后是数据高度不平衡，活性化合物与非活性化合物的比例很低。本文首先讨论了这三个问题，然后选择了几种 Weka 代价敏感分类器（朴素贝叶斯、支持向量机、C4.5 和随机森林）应用于各种生物测定数据集。

结果

制药生物测定数据不容易为学术界所获得。PubChem 所保存的数据未经整理，并且在初级筛选和确认筛选之间缺乏详细的交叉参考。至于在初级筛选过程中出现的大量假阳性，由于上面提到的缺乏交叉参考，所进行的分析还很肤浅。在发现的六个案例中，高通量初级筛选的假阳性平均百分比相当高，为 64%。对于代价敏感分类，Weka 的支持向量机和 C4.5 决策树学习者的实现表现相对较好。还发现，Weka 代价矩阵的设置取决于所使用的基础分类器，而不仅仅取决于类别不平衡的比例。

结论

可以理解的是，制药数据很难获得。然而，为初级筛选和相应的确认数据提供经过整理的信息将对制药行业和学术界都有好处。将虚拟筛选技术应用于生物测定数据可以带来两个好处。首先，通过减少要筛选的化合物的搜索空间，其次，通过分析初级筛选过程中出现的假阳性，可以改进该技术。初级筛选产生的大量假阳性导致了是否应该使用此类数据进行虚拟筛选的问题。在使用 Weka 的代价敏感分类器时需要小心——在比较同一数据集的不同分类器时，不应该基于类别比例使用一刀切的错误分类成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2349/2820499/26757e294c3c/1758-2946-1-21-1.jpg

相似文献

Virtual screening of bioassay data.

J Cheminform. 2009 Dec 22;1:21. doi: 10.1186/1758-2946-1-21.

Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets.

J Theor Biol. 2017 Dec 21;435:208-217. doi: 10.1016/j.jtbi.2017.09.018. Epub 2017 Sep 20.

Machine learning classification can reduce false positives in structure-based virtual screening.

Proc Natl Acad Sci U S A. 2020 Aug 4;117(31):18477-18488. doi: 10.1073/pnas.2000585117. Epub 2020 Jul 15.

Evaluation of QSAR Equations for Virtual Screening.

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

Enrichment of high-throughput screening data with increasing levels of noise using support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers.

J Chem Inf Model. 2006 Jan-Feb;46(1):193-200. doi: 10.1021/ci050374h.

A novel method for mining highly imbalanced high-throughput screening data in PubChem.

Bioinformatics. 2009 Dec 15;25(24):3310-6. doi: 10.1093/bioinformatics/btp589. Epub 2009 Oct 13.

Exploiting PubChem for Virtual Screening.

Expert Opin Drug Discov. 2010 Dec;5(12):1205-1220. doi: 10.1517/17460441.2010.524924.

False-positive reduction in computer-aided mass detection using mammographic texture analysis and classification.

Comput Methods Programs Biomed. 2018 Jul;160:75-83. doi: 10.1016/j.cmpb.2018.03.026. Epub 2018 Mar 31.

Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer.

J Cheminform. 2024 Oct 7;16(1):112. doi: 10.1186/s13321-024-00906-0.

引用本文的文献

Repurposing therapeutics for COVID-19: Rapid prediction of commercially available drugs through machine learning and docking.

PLoS One. 2020 Nov 12;15(11):e0241543. doi: 10.1371/journal.pone.0241543. eCollection 2020.

Benchmarking Data Sets from PubChem BioAssay Data: Current Scenario and Room for Improvement.

Int J Mol Sci. 2020 Jun 19;21(12):4380. doi: 10.3390/ijms21124380.

Computational approaches for drug discovery against trypanosomatid-caused diseases.

Parasitology. 2020 May;147(6):611-633. doi: 10.1017/S0031182020000207. Epub 2020 Feb 12.

Computer-Aided Design of Antimicrobial Peptides: Are We Generating Effective Drug Candidates?

Front Microbiol. 2020 Jan 22;10:3097. doi: 10.3389/fmicb.2019.03097. eCollection 2019.

Mimicking Strategy for Protein-Protein Interaction Inhibitor Discovery by Virtual Screening.

Molecules. 2019 Dec 4;24(24):4428. doi: 10.3390/molecules24244428.

A Machine Learning-Based Prediction Platform for P-Glycoprotein Modulators and Its Validation by Molecular Docking.

Cells. 2019 Oct 21;8(10):1286. doi: 10.3390/cells8101286.

A Ligand-Based Virtual Screening Method Using Direct Quantification of Generalization Ability.

Molecules. 2019 Jun 30;24(13):2414. doi: 10.3390/molecules24132414.

BALL DIVERGENCE: NONPARAMETRIC TWO SAMPLE TEST.

Ann Stat. 2018 Jun;46(3):1109-1137. doi: 10.1214/17-AOS1579.

Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases.

Brief Bioinform. 2019 Sep 27;20(5):1878-1912. doi: 10.1093/bib/bby061.

Performance of Machine Learning Algorithms for Qualitative and Quantitative Prediction Drug Blockade of hERG1 channel.

Comput Toxicol. 2018 May;6:55-63. doi: 10.1016/j.comtox.2017.05.001. Epub 2017 May 13.

本文引用的文献

PubChem BioAssays as a data source for predictive models.

J Mol Graph Model. 2010 Jan;28(5):420-6. doi: 10.1016/j.jmgm.2009.10.001. Epub 2009 Oct 12.

PubChem: a public information system for analyzing bioactivities of small molecules.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W623-33. doi: 10.1093/nar/gkp456. Epub 2009 Jun 4.

Dealing with a data dilemma.

Nat Rev Drug Discov. 2008 Aug;7(8):632-3. doi: 10.1038/nrd2649.

Virtual screening of Chinese herbs with Random Forest.

J Chem Inf Model. 2007 Mar-Apr;47(2):264-78. doi: 10.1021/ci600289v.

Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques.

J Chem Inf Model. 2007 Jan-Feb;47(1):92-103. doi: 10.1021/ci6002619.

PowerMV: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation.

J Chem Inf Model. 2005 Mar-Apr;45(2):515-22. doi: 10.1021/ci049847v.

The price of innovation: new estimates of drug development costs.

J Health Econ. 2003 Mar;22(2):151-85. doi: 10.1016/S0167-6296(02)00126-1.

Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.

Adv Drug Deliv Rev. 2001 Mar 1;46(1-3):3-26. doi: 10.1016/s0169-409x(00)00129-0.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生物测定数据的虚拟筛选。

Virtual screening of bioassay data.

机构信息

Smart Technology Research Centre, Bournemouth University, Poole House, Talbot Campus, Poole, Dorset, BH12 5BB, UK.