利用PubChem中的化学结构指纹和高通量筛选数据开发并验证预测性决策树模型。

Developing and validating predictive decision tree models from mining chemical structural fingerprints and high-throughput screening data in PubChem.

作者信息

Han Lianyi, Wang Yanli, Bryant Stephen H

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

出版信息

BMC Bioinformatics. 2008 Sep 25;9:401. doi: 10.1186/1471-2105-9-401.

DOI:10.1186/1471-2105-9-401

PMID:18817552

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2572623/

Abstract

BACKGROUND

Recent advances in high-throughput screening (HTS) techniques and readily available compound libraries generated using combinatorial chemistry or derived from natural products enable the testing of millions of compounds in a matter of days. Due to the amount of information produced by HTS assays, it is a very challenging task to mine the HTS data for potential interest in drug development research. Computational approaches for the analysis of HTS results face great challenges due to the large quantity of information and significant amounts of erroneous data produced.

RESULTS

In this study, Decision Trees (DT) based models were developed to discriminate compound bioactivities by using their chemical structure fingerprints provided in the PubChem system http://pubchem.ncbi.nlm.nih.gov. The DT models were examined for filtering biological activity data contained in four assays deposited in the PubChem Bioassay Database including assays tested for 5HT1a agonists, antagonists, and HIV-1 RT-RNase H inhibitors. The 10-fold Cross Validation (CV) sensitivity, specificity and Matthews Correlation Coefficient (MCC) for the models are 57.2 approximately 80.5%, 97.3 approximately 99.0%, 0.4 approximately 0.5 respectively. A further evaluation was also performed for DT models built for two independent bioassays, where inhibitors for the same HIV RNase target were screened using different compound libraries, this experiment yields enrichment factor of 4.4 and 9.7.

CONCLUSION

Our results suggest that the designed DT models can be used as a virtual screening technique as well as a complement to traditional approaches for hits selection.

摘要

背景

高通量筛选（HTS）技术的最新进展以及利用组合化学生成或源自天然产物的现成化合物库，使得能够在数天内对数百万种化合物进行测试。由于HTS分析产生的信息量巨大，在药物开发研究中挖掘HTS数据以寻找潜在的有价值信息是一项极具挑战性的任务。由于产生的信息量巨大以及大量错误数据，用于分析HTS结果的计算方法面临巨大挑战。

结果

在本研究中，开发了基于决策树（DT）的模型，通过使用美国国立医学图书馆（NLM）的化学数据库（PubChem）系统（http://pubchem.ncbi.nlm.nih.gov）中提供的化合物化学结构指纹来区分化合物的生物活性。对DT模型进行了检验，以筛选PubChem生物分析数据库中四项分析所包含的生物活性数据，这些分析包括针对5HT1a激动剂、拮抗剂和HIV-1逆转录酶-核糖核酸酶H抑制剂的测试。这些模型的10倍交叉验证（CV）灵敏度、特异性和马修斯相关系数（MCC）分别约为57.2%至80.5%、97.3%至99.0%、0.4至0.5。还对为两项独立生物分析构建的DT模型进行了进一步评估，其中使用不同的化合物库筛选针对同一HIV核糖核酸酶靶点的抑制剂，该实验产生的富集因子分别为4.4和9.7。

结论

我们的结果表明，所设计的DT模型可作为一种虚拟筛选技术，也可作为传统命中选择方法的补充。

相似文献

Developing and validating predictive decision tree models from mining chemical structural fingerprints and high-throughput screening data in PubChem.利用PubChem中的化学结构指纹和高通量筛选数据开发并验证预测性决策树模型。

BMC Bioinformatics. 2008 Sep 25;9:401. doi: 10.1186/1471-2105-9-401.

A novel method for mining highly imbalanced high-throughput screening data in PubChem.一种挖掘 PubChem 中高度不平衡高通量筛选数据的新方法。

Bioinformatics. 2009 Dec 15;25(24):3310-6. doi: 10.1093/bioinformatics/btp589. Epub 2009 Oct 13.

PubChem 2019 update: improved access to chemical data.PubChem 2019 年更新：改善化学数据获取。

Nucleic Acids Res. 2019 Jan 8;47(D1):D1102-D1109. doi: 10.1093/nar/gky1033.

Data mining a small molecule drug screening representative subset from NIH PubChem.从美国国立医学图书馆化学数据库（NIH PubChem）中挖掘小分子药物筛选代表性子集。

J Chem Inf Model. 2008 Mar;48(3):465-75. doi: 10.1021/ci700193u. Epub 2008 Feb 27.

Data mining PubChem using a support vector machine with the Signature molecular descriptor: classification of factor XIa inhibitors.使用带有特征分子描述符的支持向量机挖掘PubChem数据：凝血因子Xa抑制剂的分类

J Mol Graph Model. 2008 Nov;27(4):466-75. doi: 10.1016/j.jmgm.2008.08.004. Epub 2008 Aug 27.

Using information from historical high-throughput screens to predict active compounds.利用历史高通量筛选信息预测活性化合物。

J Chem Inf Model. 2014 Jul 28;54(7):1880-91. doi: 10.1021/ci500190p. Epub 2014 Jun 26.

Mechanism Profiling of Hepatotoxicity Caused by Oxidative Stress Using Antioxidant Response Element Reporter Gene Assay Models and Big Data.使用抗氧化反应元件报告基因检测模型和大数据对氧化应激引起的肝毒性进行机制剖析

Environ Health Perspect. 2016 May;124(5):634-41. doi: 10.1289/ehp.1509763. Epub 2015 Sep 18.

Designing focused chemical libraries enriched in protein-protein interaction inhibitors using machine-learning methods.使用机器学习方法设计富含蛋白质-蛋白质相互作用抑制剂的聚焦化学文库。

PLoS Comput Biol. 2010 Mar 5;6(3):e1000695. doi: 10.1371/journal.pcbi.1000695.

A comprehensive support vector machine binary hERG classification model based on extensive but biased end point hERG data sets.基于广泛但存在偏倚的终点 hERG 数据集的全面支持向量机二进制 hERG 分类模型。

Chem Res Toxicol. 2011 Jun 20;24(6):934-49. doi: 10.1021/tx200099j. Epub 2011 May 6.

Using the BioAssay Ontology for analyzing high-throughput screening data.使用生物测定本体论分析高通量筛选数据。

J Biomol Screen. 2015 Mar;20(3):402-15. doi: 10.1177/1087057114563493. Epub 2014 Dec 15.

引用本文的文献

Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data.化学数据共享：经验教训和强制结构化反应数据的案例

J Chem Inf Model. 2023 Jul 24;63(14):4253-4265. doi: 10.1021/acs.jcim.3c00607. Epub 2023 Jul 5.

A Perspective on Explanations of Molecular Prediction Models.分子预测模型解释的透视。

J Chem Theory Comput. 2023 Apr 25;19(8):2149-2160. doi: 10.1021/acs.jctc.2c01235. Epub 2023 Mar 27.

CardioTox net: a robust predictor for hERG channel blockade based on deep learning meta-feature ensembles.心脏毒性网络：基于深度学习元特征集成的hERG通道阻断的强大预测器。

J Cheminform. 2021 Aug 16;13(1):60. doi: 10.1186/s13321-021-00541-z.

Quantitative Toxicity Prediction via Meta Ensembling of Multitask Deep Learning Models.通过多任务深度学习模型的元集成进行定量毒性预测

ACS Omega. 2021 May 3;6(18):12306-12317. doi: 10.1021/acsomega.1c01247. eCollection 2021 May 11.

Predicting Meridian in Chinese traditional medicine using machine learning approaches.运用机器学习方法预测中医经络。

PLoS Comput Biol. 2019 Nov 25;15(11):e1007249. doi: 10.1371/journal.pcbi.1007249. eCollection 2019 Nov.

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.一种有效的支持生物注释的生物医学文献分类方案：解决类不平衡问题。

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz045.

Prediction of Compound Profiling Matrices Using Machine Learning.使用机器学习预测化合物分析矩阵

ACS Omega. 2018 Apr 30;3(4):4713-4723. doi: 10.1021/acsomega.8b00462.

Virtual screening by a new Clustering-based Weighted Similarity Extreme Learning Machine approach.基于聚类的加权相似极限学习机的虚拟筛选新方法。

PLoS One. 2018 Apr 13;13(4):e0195478. doi: 10.1371/journal.pone.0195478. eCollection 2018.

Combination therapeutics in complex diseases.复杂疾病中的联合疗法。

J Cell Mol Med. 2016 Dec;20(12):2231-2240. doi: 10.1111/jcmm.12930. Epub 2016 Sep 7.

Mining Chemical Activity Status from High-Throughput Screening Assays.从高通量筛选实验中挖掘化学活性状态

PLoS One. 2015 Dec 14;10(12):e0144426. doi: 10.1371/journal.pone.0144426. eCollection 2015.

本文引用的文献

Virtual screening of Chinese herbs with Random Forest.基于随机森林的中药虚拟筛选

J Chem Inf Model. 2007 Mar-Apr;47(2):264-78. doi: 10.1021/ci600289v.

Data-mining methods as useful tools for predicting individual drug response: application to CYP2D6 data.数据挖掘方法作为预测个体药物反应的有用工具：在CYP2D6数据中的应用。

Hum Hered. 2006;62(3):119-34. doi: 10.1159/000096416. Epub 2006 Oct 20.

Classification of dopamine, serotonin, and dual antagonists by decision trees.通过决策树对多巴胺、血清素和双重拮抗剂进行分类。

Bioorg Med Chem. 2006 Apr 15;14(8):2763-70. doi: 10.1016/j.bmc.2005.11.059. Epub 2006 Jan 4.

A chemoinformatics analysis of hit lists obtained from high-throughput affinity-selection screening.对通过高通量亲和筛选获得的命中列表进行的化学信息学分析。

J Biomol Screen. 2006 Mar;11(2):123-30. doi: 10.1177/1087057105283579. Epub 2005 Dec 16.

Novel statistical approach for primary high-throughput screening hit selection.用于初级高通量筛选命中选择的新型统计方法。

J Chem Inf Model. 2005 Nov-Dec;45(6):1784-90. doi: 10.1021/ci0502808.

Statistical analysis of systematic errors in high-throughput screening.高通量筛选中系统误差的统计分析

J Biomol Screen. 2005 Sep;10(6):557-67. doi: 10.1177/1087057105276989. Epub 2005 Aug 15.

Prediction of HIV-1 protease inhibitor resistance using a protein-inhibitor flexible docking approach.使用蛋白质-抑制剂柔性对接方法预测HIV-1蛋白酶抑制剂耐药性。

Antivir Ther. 2005;10(1):157-66.

Computer-aided drug design strategies used in the discovery of fructose 1, 6-bisphosphatase inhibitors.用于发现果糖1,6-二磷酸酶抑制剂的计算机辅助药物设计策略。

Curr Pharm Des. 2005;11(3):283-94. doi: 10.2174/1381612053382160.

Pursuing the leadlikeness concept in pharmaceutical research.在药物研究中追求类先导物概念。

Curr Opin Chem Biol. 2004 Jun;8(3):255-63. doi: 10.1016/j.cbpa.2004.04.003.

Cell-based partitioning.基于细胞的分区

Methods Mol Biol. 2004;275:279-90. doi: 10.1385/1-59259-802-1:279.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验