College of Physics Science and Technology, Yangzhou University, Jiangsu 225009, China.
J Chem Inf Model. 2024 Aug 12;64(15):5853-5866. doi: 10.1021/acs.jcim.4c00586. Epub 2024 Jul 25.
Machine learning plays a role in accelerating drug discovery, and the design of effective machine learning models is crucial for accurately predicting molecular properties. Characterizing molecules typically involves the use of molecular fingerprints and molecular graphs. These are input into a multilayer perceptron (MLP) and variants of graph neural networks, such as graph attention networks (GATs). Due to the diverse types and large dimension of fingerprints, models may contain many features that are relatively irrelevant or redundant; meanwhile, although the GAT excels in handling heterogeneous graph tasks, it lacks the ability to extract collaborative information from neighboring nodes, which is crucial in scenarios where it cannot capture the joint influence of adjacent groups on atoms. To overcome these challenges, we introduce a hybrid model, combining improved GAT and MLP. In GAT, the recurrent neural network is employed to capture collaborative information. To address the dimensionality issue, we propose a feature selection algorithm, which is based on the principle of maximizing relevance while minimizing redundancy. Through experiments on 13 public data sets and 14 breast cell lines, our model demonstrates superior performance compared to state-of-the-art deep learning and traditional machine learning algorithms. Additionally, a series of ablation experiments were conducted to demonstrate the advantages of our improved version, as well as its antinoise capability and interpretability. These results indicate that our model holds promising prospects for practical applications.
机器学习在加速药物发现方面发挥着作用,设计有效的机器学习模型对于准确预测分子性质至关重要。分子特征通常涉及使用分子指纹和分子图。这些输入到多层感知机(MLP)和图神经网络的变体,如图注意网络(GAT)中。由于指纹的类型多样且维度较大,模型可能包含许多相对无关或冗余的特征;同时,虽然 GAT 在处理异构图任务方面表现出色,但它缺乏从相邻节点提取协作信息的能力,而在无法捕捉相邻基团对原子的联合影响的情况下,这是至关重要的。为了克服这些挑战,我们引入了一种混合模型,结合了改进的 GAT 和 MLP。在 GAT 中,递归神经网络用于捕获协作信息。为了解决维度问题,我们提出了一种特征选择算法,该算法基于最大化相关性同时最小化冗余性的原则。通过在 13 个公共数据集和 14 个乳腺细胞系上进行实验,我们的模型与最先进的深度学习和传统机器学习算法相比表现出了优越的性能。此外,还进行了一系列消融实验,以证明我们改进版本的优势,以及其抗噪声能力和可解释性。这些结果表明,我们的模型在实际应用中具有广阔的前景。