训练数据大小和噪声水平对支持向量机从大型化合物库中虚拟筛选遗传毒性化合物的影响。

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries.

机构信息

Bioinformatics and Drug Design Group, Centre for Computational Science and Engineering, Department of Pharmacy, National University of Singapore.

出版信息

J Comput Aided Mol Des. 2011 May;25(5):455-67. doi: 10.1007/s10822-011-9431-3. Epub 2011 May 10.

DOI:10.1007/s10822-011-9431-3

PMID:21556903

Abstract

Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT-) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT- compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT- compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT-, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1-51.9% GT+ and 75-93% GT- rates of existing in-silico methods, 58.8% GT+ and 79% GT- rates of Ames method, and the estimated percentages of 23% in vivo and 31-33% in vitro GT+ compounds in the "universe of chemicals". There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT- MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.

摘要

各种体外和计算机模拟方法已被用于药物遗传毒性测试，这些方法显示出有限的遗传毒性（GT+）和非遗传毒性（GT-）识别率。新的方法和组合方法已经被探索用于提高集体识别能力。通过大量最近报道的 GT+和 GT-化合物丰富的大量训练数据，可进一步提高计算机模拟方法的识别率，但主要关注的是体外数据高假阳性率导致的噪声水平增加。在这项工作中，我们评估了训练数据大小和噪声水平对支持向量机（SVM）方法性能的影响，该方法已知可耐受训练数据中的高噪声水平。开发并测试了两种具有不同多样性/噪声水平的 SVM。由更高多样性更高噪声数据（任何体内或体外试验中的 GT+）训练的 H-SVM 优于由更低噪声更低多样性数据（仅体内或 Ames 试验中的 GT+）训练的 L-SVM。由 2008 年前报告的 4,763 种 GT+化合物和排除临床试验药物的 8,232 种 GT-化合物训练的 H-SVM 正确识别了自 2008 年以来报告的 38 种 GT+化合物中的 81.6%，预测了 2,008 种临床试验药物中的 83.1%为 GT-，预测了 168 K MDDR 的 23.96%和 17.86M PubChem 化合物的 27.23%为 GT+。这些与现有的计算机模拟方法的 43.1-51.9% GT+和 75-93% GT-识别率、Ames 方法的 58.8% GT+和 79% GT-识别率以及“化学物质宇宙”中估计的 23%体内和 31-33%体外 GT+化合物的百分比相当。H-SVM 和 L-SVM 预测的 GT+和 GT- MDDR 化合物与 TOPKAT 的预测之间存在相当大的一致性。SVM 显示出从大型化合物库中识别 GT+化合物的良好潜力，基于更高的多样性和更高的噪声训练数据。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

训练数据大小和噪声水平对支持向量机从大型化合物库中虚拟筛选遗传毒性化合物的影响。

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries.

机构信息

出版信息

相似文献

本文引用的文献

训练数据大小和噪声水平对支持向量机从大型化合物库中虚拟筛选遗传毒性化合物的影响。

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries.

机构信息

出版信息

相似文献

本文引用的文献