Department of Computer Engineering, Bogazici University, Istanbul, Turkey.
Department of Chemical Engineering, Bogazici University, Istanbul, Turkey.
Bioinformatics. 2018 Jul 1;34(13):i295-i303. doi: 10.1093/bioinformatics/bty287.
The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.
We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation.
https://github.com/hkmztrk/SMILESVecProteinRepresentation.
Supplementary data are available at Bioinformatics online.
蛋白质的有效表示是一个关键任务,直接影响许多生物信息学问题的性能。相关的蛋白质通常与类似的配体结合。已知配体的化学特征可以捕获蛋白质的功能和机制特性,这表明可以在蛋白质表示中利用基于配体的方法。在这项研究中,我们提出了 SMILESVec,一种基于简化分子输入行输入系统(SMILES)的方法来表示配体,以及一种通过基于其配体来描述蛋白质来计算蛋白质相似性的新方法。蛋白质是利用其配体的 SMILES 字符串的词嵌入来定义的。使用 TransClust 和 MCL 算法在蛋白质聚类任务中评估了所提出的蛋白质描述方法的性能。还比较了另外两种利用蛋白质序列的蛋白质表示方法,即基本局部比对工具和 ProtVec,以及两种基于化合物指纹的蛋白质表示方法。
我们表明,仅使用蛋白质结合的配体的 SMILES 字符串的基于配体的蛋白质表示在蛋白质聚类中与基于蛋白质序列的表示方法一样有效。结果表明,基于配体的蛋白质描述可以替代传统的基于序列或结构的蛋白质表示,并且这种新方法可以应用于不同的生物信息学问题,例如预测新的蛋白质-配体相互作用和蛋白质功能注释。
https://github.com/hkmztrk/SMILESVecProteinRepresentation。
补充数据可在 Bioinformatics 在线获得。