使用小分子数据集的潜在 HIV-1 蛋白酶抑制剂生物活性的模糊 ART MAP 预测。

Fuzzy ARTMAP prediction of biological activities for potential HIV-1 protease inhibitors using a small molecular data set.

机构信息

Computer Science Department, Central Washington University, 400 E. University Way, Ellensburg, WA 98926, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jan-Mar;8(1):80-93. doi: 10.1109/TCBB.2009.50.

DOI:10.1109/TCBB.2009.50

Abstract

Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational intelligence prediction techniques which are suitable for small training sets, at the expense of some computational overhead. Both techniques are based on the FAMR model. The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification and probability estimation. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. In our experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in [4], [5]. We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors.

摘要

利用神经网络获得满意的结果取决于是否有大量的数据样本。使用小的训练集通常会降低性能。大多数针对特定酶系统的经典定量构效关系（QSAR）研究都是基于小的数据集进行的。我们专注于从小的训练集中推断出 HIV-1 蛋白酶抑制化合物的生物活性的神经模糊预测。我们提出了两种适合小训练集的计算智能预测技术，这些技术会带来一些计算开销。这两种技术都是基于 FAMR 模型。FAMR 是一种用于分类和概率估计的模糊自适应共振理论（FAM）增量学习系统。在学习阶段，每个样本对都被分配一个与该对重要性成比例的相关因子。本文提出的两种算法是：1）GA-FAMR 算法，这是一种新算法，它由两个阶段组成：a）在第一阶段，我们使用遗传算法（GA）来优化分配给训练数据的相关性。这提高了 FAMR 的泛化能力。b）在第二阶段，我们使用优化后的相关性来训练 FAMR。2）有序 FAMR 是从一种已知的算法中衍生出来的。它不是通过优化相关性，而是使用 Dagher 等人的算法来优化数据呈现的顺序。在我们的实验中，我们将这两种算法与不基于 FAM 的算法，即 [4]、[5] 中引入的 FS-GA-FNN 进行了比较。我们得出的结论是，从小的训练集中推断时，这两种技术在泛化能力和执行时间方面都很有效。引入的计算开销通过更高的准确性得到了补偿。最后，我们使用提出的技术来预测新设计的潜在 HIV-1 蛋白酶抑制剂的生物活性。