Ranjan Ashish, Fahad Md Shah, Fernandez-Baca David, Tripathi Sudhakar, Deepak Akshay
IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1188-1199. doi: 10.1109/TCBB.2022.3173789. Epub 2023 Apr 3.
This paper advances the self-attention mechanism in the standard transformer network specific to the modeling of the protein sequences. We introduce a novel context-window based scaled self-attention mechanism for processing protein sequences that is based on the notion of (i) local context and (ii) large contextual pattern. Both notions are essential to building a good representation for protein sequences. The proposed context-window based scaled self-attention mechanism is further used to build the multi context-window based scaled (MCWS) transformer network for the protein function prediction task at the protein sub-sequence level. Overall, the proposed MCWS transformer network produced improved predictive performances, outperforming existing state-of-the-art approaches by substantial margins. With respect to the standard transformer network, the proposed network produced improvements in F1-score of +2.30% and +2.08% on the biological process (BP) and molecular function (MF) datasets, respectively. The corresponding improvements over the state-of-the-art ProtVecGen-Plus+ProtVecGen-Ensemble approach are +3.38% (BP) and +2.86% (MF). Equally important, robust performances were obtained across protein sequences of different lengths.
本文改进了标准变压器网络中用于蛋白质序列建模的自注意力机制。我们引入了一种新颖的基于上下文窗口的缩放自注意力机制来处理蛋白质序列,该机制基于(i)局部上下文和(ii)大上下文模式的概念。这两个概念对于构建蛋白质序列的良好表示都至关重要。所提出的基于上下文窗口的缩放自注意力机制进一步用于构建基于多上下文窗口的缩放(MCWS)变压器网络,用于蛋白质子序列水平的蛋白质功能预测任务。总体而言,所提出的MCWS变压器网络产生了改进的预测性能,大幅优于现有的最先进方法。相对于标准变压器网络,所提出的网络在生物过程(BP)和分子功能(MF)数据集上的F1分数分别提高了+2.30%和+2.08%。相对于最先进的ProtVecGen-Plus+ProtVecGen-Ensemble方法,相应的改进分别为+3.38%(BP)和+2.86%(MF)。同样重要的是,在不同长度的蛋白质序列上都获得了稳健的性能。