Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Bioinformatics. 2022 Jan 27;38(4):990-996. doi: 10.1093/bioinformatics/btab777.
Accurate prediction of protein structure relies heavily on exploiting multiple sequence alignment (MSA) for residue mutations and correlations as this information specifies protein tertiary structure. The widely used prediction approaches usually transform MSA into inter-mediate models, say position-specific scoring matrix or profile hidden Markov model. These inter-mediate models, however, cannot fully represent residue mutations and correlations carried by MSA; hence, an effective way to directly exploit MSAs is highly desirable.
Here, we report a novel sequence set network (called Seq-SetNet) to directly and effectively exploit MSA for protein structure prediction. Seq-SetNet uses an 'encoding and aggregation' strategy that consists of two key elements: (i) an encoding module that takes a component homologue in MSA as input, and encodes residue mutations and correlations into context-specific features for each residue; and (ii) an aggregation module to aggregate the features extracted from all component homologues, which are further transformed into structural properties for residues of the query protein. As Seq-SetNet encodes each homologue protein individually, it could consider both insertions and deletions, as well as long-distance correlations among residues, thus representing more information than the inter-mediate models. Moreover, the encoding module automatically learns effective features and thus avoids manual feature engineering. Using symmetric aggregation functions, Seq-SetNet processes the homologue proteins as a sequence set, making its prediction results invariable to the order of these proteins. On popular benchmark sets, we demonstrated the successful application of Seq-SetNet to predict secondary structure and torsion angles of residues with improved accuracy and efficiency.
The code and datasets are available through https://github.com/fusong-ju/Seq-SetNet.
Supplementary data are available at Bioinformatics online.
准确预测蛋白质结构在很大程度上依赖于利用多重序列比对 (MSA) 来预测残基突变和相关性,因为这些信息指定了蛋白质的三级结构。广泛使用的预测方法通常将 MSA 转化为中间模型,例如位置特异性评分矩阵或轮廓隐马尔可夫模型。然而,这些中间模型不能完全表示 MSA 携带的残基突变和相关性;因此,直接利用 MSA 的有效方法是非常需要的。
在这里,我们报告了一种新的序列集网络(称为 Seq-SetNet),用于直接有效地利用 MSA 进行蛋白质结构预测。Seq-SetNet 使用一种“编码和聚合”策略,该策略由两个关键要素组成:(i) 编码模块,该模块以 MSA 中的一个组件同源物为输入,将残基突变和相关性编码为每个残基的特定上下文特征;和 (ii) 聚合模块,用于聚合从所有组件同源物中提取的特征,这些特征进一步转化为查询蛋白质残基的结构特性。由于 Seq-SetNet 为每个同源物蛋白质单独编码,它可以考虑插入和缺失以及残基之间的长距离相关性,从而比中间模型表示更多的信息。此外,编码模块自动学习有效的特征,从而避免了手动特征工程。使用对称聚合函数,Seq-SetNet 将同源物蛋白质作为一个序列集进行处理,使其预测结果不受这些蛋白质顺序的影响。在流行的基准数据集上,我们证明了 Seq-SetNet 在预测残基二级结构和扭转角方面的成功应用,提高了准确性和效率。
代码和数据集可通过 https://github.com/fusong-ju/Seq-SetNet 获得。
补充数据可在 Bioinformatics 在线获得。