利用片段化学数据挖掘和概率神经网络筛选对黑头呆鱼具有急性毒性的化学物质。

Using fragment chemistry data mining and probabilistic neural networks in screening chemicals for acute toxicity to the fathead minnow.

作者信息

Niculescu S P, Atkinson A, Hammond G, Lewis M

出版信息

SAR QSAR Environ Res. 2004 Aug;15(4):293-309. doi: 10.1080/10629360410001724941.

Abstract

The paper is illustrating how the general data mining methodology may be adapted to provide solutions to the problem of high throughput virtual screening of organic chemicals for possible acute toxicity to the fathead minnow fish. The present approach involves mining fragment information from chemical structures and is using probabilistic neural networks to model the relationship between structure and toxicity. Probabilistic neural networks implement a special class of multivariate non-linear Bayesian statistical models. The mathematical principles supporting their use for value prediction purposes are clarified and their peculiarities discussed. As part of the research phase of the data mining process, a dataset consisting of 800 structures and associated fathead minnow (Pimephales promelas) 96-h LC50 acute toxicity endpoint information is used for both the purpose of identifying an advantageous combination of fragment descriptors and for training the neural networks. As a result, two powerful models are generated. Model 1 implements the basic PNN with Gaussian kernel (statistical corrections included) while Model 2 implements the PNN with Gaussian kernel and separated variables. External validation is performed using a separate dataset consisting of 86 structures and associated toxicity information. Both learning and generalization capabilities of the two models are investigated and their limitations discussed.

摘要

本文阐述了如何调整通用数据挖掘方法，以解决对黑头呆鱼可能具有急性毒性的有机化学品高通量虚拟筛选问题。当前方法涉及从化学结构中挖掘片段信息，并使用概率神经网络对结构与毒性之间的关系进行建模。概率神经网络实现了一类特殊的多元非线性贝叶斯统计模型。阐明了支持其用于值预测目的的数学原理，并讨论了它们的特性。作为数据挖掘过程研究阶段的一部分，一个由800个结构以及相关的黑头呆鱼（Pimephales promelas）96小时半数致死浓度（LC50）急性毒性终点信息组成的数据集，用于确定片段描述符的有利组合以及训练神经网络。结果，生成了两个强大的模型。模型1采用带有高斯核的基本概率神经网络（包括统计校正），而模型2采用带有高斯核和分离变量的概率神经网络。使用由86个结构及相关毒性信息组成的单独数据集进行外部验证。研究了这两个模型的学习和泛化能力，并讨论了它们的局限性。