Han Lianyi, Wang Yanli, Bryant Stephen H
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
BMC Bioinformatics. 2008 Sep 25;9:401. doi: 10.1186/1471-2105-9-401.
Recent advances in high-throughput screening (HTS) techniques and readily available compound libraries generated using combinatorial chemistry or derived from natural products enable the testing of millions of compounds in a matter of days. Due to the amount of information produced by HTS assays, it is a very challenging task to mine the HTS data for potential interest in drug development research. Computational approaches for the analysis of HTS results face great challenges due to the large quantity of information and significant amounts of erroneous data produced.
In this study, Decision Trees (DT) based models were developed to discriminate compound bioactivities by using their chemical structure fingerprints provided in the PubChem system http://pubchem.ncbi.nlm.nih.gov. The DT models were examined for filtering biological activity data contained in four assays deposited in the PubChem Bioassay Database including assays tested for 5HT1a agonists, antagonists, and HIV-1 RT-RNase H inhibitors. The 10-fold Cross Validation (CV) sensitivity, specificity and Matthews Correlation Coefficient (MCC) for the models are 57.2 approximately 80.5%, 97.3 approximately 99.0%, 0.4 approximately 0.5 respectively. A further evaluation was also performed for DT models built for two independent bioassays, where inhibitors for the same HIV RNase target were screened using different compound libraries, this experiment yields enrichment factor of 4.4 and 9.7.
Our results suggest that the designed DT models can be used as a virtual screening technique as well as a complement to traditional approaches for hits selection.
高通量筛选(HTS)技术的最新进展以及利用组合化学生成或源自天然产物的现成化合物库,使得能够在数天内对数百万种化合物进行测试。由于HTS分析产生的信息量巨大,在药物开发研究中挖掘HTS数据以寻找潜在的有价值信息是一项极具挑战性的任务。由于产生的信息量巨大以及大量错误数据,用于分析HTS结果的计算方法面临巨大挑战。
在本研究中,开发了基于决策树(DT)的模型,通过使用美国国立医学图书馆(NLM)的化学数据库(PubChem)系统(http://pubchem.ncbi.nlm.nih.gov)中提供的化合物化学结构指纹来区分化合物的生物活性。对DT模型进行了检验,以筛选PubChem生物分析数据库中四项分析所包含的生物活性数据,这些分析包括针对5HT1a激动剂、拮抗剂和HIV-1逆转录酶-核糖核酸酶H抑制剂的测试。这些模型的10倍交叉验证(CV)灵敏度、特异性和马修斯相关系数(MCC)分别约为57.2%至80.5%、97.3%至99.0%、0.4至0.5。还对为两项独立生物分析构建的DT模型进行了进一步评估,其中使用不同的化合物库筛选针对同一HIV核糖核酸酶靶点的抑制剂,该实验产生的富集因子分别为4.4和9.7。
我们的结果表明,所设计的DT模型可作为一种虚拟筛选技术,也可作为传统命中选择方法的补充。