Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore, 6397983.
Anal Biochem. 2019 Jul 15;577:73-81. doi: 10.1016/j.ab.2019.04.011. Epub 2019 Apr 22.
Membrane transport proteins and their substrate specificities play crucial roles in various cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis, thus being an important problem for bioinformatics researchers. In this study, we applied word embedding approach, the main cause for natural language processing breakout in recent years, to protein sequences of transporters. We defined each protein sequence based on the word embeddings and frequencies of its biological words. The protein features were then fed into machine learning models for prediction. We also varied the lengths of protein sequence's constituent biological words to find the optimal length which generated the most discriminative feature set. Compared to four other feature types created from protein sequences, our proposed features can help prediction models yield superior performance. Our best models reach an average area under the curve of 0.96 and 0.99, respectively on the 5-fold cross validation and the independent test. With this result, our study can help biologists identify transporters based on substrate specificities as well as provides a basis for further research that enriches a field of applying natural language processing techniques in bioinformatics.
膜转运蛋白及其底物特异性在各种细胞功能中起着至关重要的作用。鉴定膜转运蛋白的底物特异性与蛋白质-靶相互作用预测、药物设计、膜募集和失调分析密切相关,因此是生物信息学研究人员的一个重要问题。在这项研究中,我们将自然语言处理近年来取得突破的主要方法——词嵌入方法应用于转运蛋白的蛋白质序列。我们根据词嵌入和生物词的频率来定义每个蛋白质序列。然后,将蛋白质特征输入机器学习模型进行预测。我们还改变了蛋白质序列组成生物词的长度,以找到产生最具区分性特征集的最佳长度。与从蛋白质序列创建的其他四种特征类型相比,我们提出的特征可以帮助预测模型产生更好的性能。我们的最佳模型在 5 折交叉验证和独立测试中的平均曲线下面积分别达到 0.96 和 0.99。有了这个结果,我们的研究可以帮助生物学家根据底物特异性来识别转运蛋白,并为进一步研究提供基础,丰富了将自然语言处理技术应用于生物信息学的领域。