Ibtehaz Nabil, Sourav S M Shakhawat Hossain, Bayzid Md Shamsuzzoha, Rahman M Sohel
Department of CSE, BUET, ECE Building, West Palasi, Dhaka, 1205, Bangladesh.
Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh.
Protein J. 2023 Apr;42(2):135-146. doi: 10.1007/s10930-023-10096-7. Epub 2023 Mar 28.
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
新一代测序技术的出现极大地增加了生物序列数据的量。蛋白质序列被誉为“生命的语言”,已被用于众多应用和推理分析。由于深度学习的快速发展,近年来自然语言处理领域取得了许多突破。由于这些方法在使用足够数量的数据进行训练时能够执行不同的任务,因此现成的模型被用于各种生物学应用。在本研究中,我们研究了流行的Skip-gram模型在蛋白质序列分析中的适用性,并尝试将一些生物学见解融入其中。我们提出了一种新颖的k-mer嵌入方案Align-gram,它能够在向量空间中将相似的k-mer映射到彼此附近。此外,我们对其他基于序列的蛋白质表示进行了实验,观察到从Align-gram派生的嵌入有助于更好地对深度学习模型进行建模和训练。我们使用简单的基线LSTM模型和DeepGoPlus的更复杂的CNN模型进行的实验表明,Align-gram在执行不同类型的蛋白质序列分析深度学习应用方面具有潜力。