School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China.
Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China.
Bioinformatics. 2022 Sep 30;38(19):4573-4580. doi: 10.1093/bioinformatics/btac550.
Extracting useful molecular features is essential for molecular property prediction. Atom-level representation is a common representation of molecules, ignoring the sub-structure or branch information of molecules to some extent; however, it is vice versa for the substring-level representation. Both atom-level and substring-level representations may lose the neighborhood or spatial information of molecules. While molecular graph representation aggregating the neighborhood information of a molecule has a weak ability in expressing the chiral molecules or symmetrical structure. In this article, we aim to make use of the advantages of representations in different granularities simultaneously for molecular property prediction. To this end, we propose a fusion model named MultiGran-SMILES, which integrates the molecular features of atoms, sub-structures and graphs from the input. Compared with the single granularity representation of molecules, our method leverages the advantages of various granularity representations simultaneously and adjusts the contribution of each type of representation adaptively for molecular property prediction.
The experimental results show that our MultiGran-SMILES method achieves state-of-the-art performance on BBBP, LogP, HIV and ClinTox datasets. For the BACE, FDA and Tox21 datasets, the results are comparable with the state-of-the-art models. Moreover, the experimental results show that the gains of our proposed method are bigger for the molecules with obvious functional groups or branches.
The code and data underlying this work are available on GitHub at https://github. com/Jiangjing0122/MultiGran.
Supplementary data are available at Bioinformatics online.
提取有用的分子特征对于分子性质预测至关重要。原子级表示是一种常见的分子表示方法,在某种程度上忽略了分子的子结构或分支信息;然而,子串级表示则相反。原子级和子串级表示都可能丢失分子的邻域或空间信息。而聚合分子邻域信息的分子图表示在表达手性分子或对称结构方面能力较弱。在本文中,我们旨在同时利用不同粒度表示的优势来进行分子性质预测。为此,我们提出了一种名为 MultiGran-SMILES 的融合模型,它从输入中集成了原子、子结构和图形的分子特征。与分子的单一粒度表示相比,我们的方法同时利用了各种粒度表示的优势,并自适应地调整每种表示类型的贡献,以进行分子性质预测。
实验结果表明,我们的 MultiGran-SMILES 方法在 BBBP、LogP、HIV 和 ClinTox 数据集上达到了最先进的性能。对于 BACE、FDA 和 Tox21 数据集,结果与最先进的模型相当。此外,实验结果表明,对于具有明显官能团或分支的分子,我们提出的方法的增益更大。
本工作的代码和数据可在 GitHub 上获得,网址为 https://github.com/Jiangjing0122/MultiGran。
补充数据可在生物信息学在线获得。