Neuroscience Research Institute, National Institute of Advanced Industrial Science and Technology, Umezono 1-1-1, Tsukuba, 305-8568, Japan.
Mol Divers. 2010 Nov;14(4):789-802. doi: 10.1007/s11030-010-9232-y. Epub 2010 Feb 26.
The Carcinogenicity Reliability Database (CRDB) was constructed by collecting experimental carcinogenicity data on about 1,500 chemicals from six sources, including IARC, and NTP databases, and then by ranking their reliabilities into six unified categories. A wide variety of 911 organic chemicals were selected from the database for QSAR modeling, and 1,504 kinds of different molecular descriptors were calculated, based on their 3D molecular structures as modeled by the Dragon software. Positive (carcinogenic) and negative (non-carcinogenic) chemicals containing various substructures were counted using atom and functional group count descriptors, and the statistical significance of ratios of positives to negatives was tested for those substructures. Very few were judged to be strongly related to carcinogenicity, among substructures known to be responsible for carcinogens as revealed from biomedical studies. In order to develop QSAR models for the prediction of the carcinogenicities of a wide variety of chemicals with a satisfactory performance level, the relationship between the carcinogenicity data with improved reliability and a subset of significant descriptors selected from 1,504 Dragon descriptors was analyzed with a support vector machine (SVM) method: the classification function (SVC) for weighted data in LIBSVM program was used to classify chemicals into two carcinogenic categories (positive or negative), where weights were set depending on the reliabilities of the carcinogenicity data. The quality and stability of the models presented were tested by performing a dual cross-validation procedure. A single SVM model as the first step was developed for all the 911 chemicals using 250 selected descriptors, achieving an overall accuracy level, i.e., positive and negative correct estimate, of about 70%. In order to improve the accuracy of the final model, the 911 chemicals were classified into 20 mutually overlapping subgroups according to contained substructures, a specific SVM model was optimized for each subgroup, and the predicted carcinogenicities of the 911 chemicals were determined by the majorities of the outputs of the corresponding SVM models. The model developed on the basis of grouping of chemicals into 20 substructures predicts the carcinogenicities of a wide variety of chemicals with a satisfactory overall accuracy of approximately 80%.
致癌可靠性数据库 (CRDB) 通过从六个来源(包括 IARC 和 NTP 数据库)收集约 1500 种化学物质的实验致癌性数据,然后将其可靠性分为六个统一类别来构建。从数据库中选择了广泛的 911 种有机化学品进行 QSAR 建模,并基于 Dragon 软件对其 3D 分子结构建模,计算了 1504 种不同的分子描述符。使用原子和官能团计数描述符对包含各种亚结构的阳性(致癌)和阴性(非致癌)化学物质进行计数,并对这些亚结构中阳性与阴性的比例进行了统计学意义检验。在从生物医学研究中发现的与致癌性有关的已知亚结构中,很少有亚结构被判断为与致癌性有很强的相关性。为了开发能够以令人满意的性能水平预测广泛的化学物质致癌性的 QSAR 模型,对与可靠性提高的致癌性数据以及从 1504 个 Dragon 描述符中选择的一组重要描述符之间的关系进行了分析,采用支持向量机 (SVM) 方法:LIBSVM 程序中的加权数据分类函数 (SVC) 用于将化学物质分为两类致癌类别(阳性或阴性),其中权重取决于致癌性数据的可靠性。通过执行双重交叉验证程序来测试模型的质量和稳定性。使用 250 个选定的描述符为所有 911 种化学物质开发了一个单一的 SVM 模型,总体准确率(即阳性和阴性正确估计)约为 70%。为了提高最终模型的准确性,根据包含的亚结构将 911 种化学物质分为 20 个相互重叠的亚组,为每个亚组优化了特定的 SVM 模型,并通过相应 SVM 模型的输出多数来确定 911 种化学物质的预测致癌性。基于将化学物质分组为 20 个子结构的模型可预测具有令人满意的总体准确性(约 80%)的广泛的化学物质的致癌性。