Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.
Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Kraków, Poland.
J Comput Chem. 2021 Jan 5;42(1):50-59. doi: 10.1002/jcc.26432. Epub 2020 Oct 15.
Predicting protein function and structure from sequence remains an unsolved problem in bioinformatics. The best performing methods rely heavily on evolutionary information from multiple sequence alignments, which means their accuracy deteriorates for sequences with a few homologs, and given the increasing sequence database sizes requires long computation times. Here, a single-sequence-based prediction method is presented, called ProteinUnet, leveraging an U-Net convolutional network architecture. It is compared to SPIDER3-Single model, based on long short-term memory-bidirectional recurrent neural networks architecture. Both methods achieve similar results for prediction of secondary structures (both three- and eight-state), half-sphere exposure, and contact number, but ProteinUnet has two times fewer parameters, 17 times shorter inference time, and can be trained 11 times faster. Moreover, ProteinUnet tends to be better for short sequences and residues with a low number of local contacts. Additionally, the method of loss weighting is presented as an effective way of increasing accuracy for rare secondary structures.
从序列预测蛋白质功能和结构仍然是生物信息学中的一个未解决的问题。表现最好的方法严重依赖于来自多序列比对的进化信息,这意味着它们的准确性对于具有少数同源物的序列会降低,并且随着序列数据库大小的增加,需要较长的计算时间。在这里,提出了一种基于单序列的预测方法,称为 ProteinUnet,利用 U-Net 卷积网络架构。将其与基于长短期记忆-双向递归神经网络架构的 SPIDER3-Single 模型进行比较。这两种方法在预测二级结构(三态和八态)、半球暴露和接触数方面都取得了相似的结果,但 ProteinUnet 的参数少两倍,推断时间短 17 倍,训练速度快 11 倍。此外,ProteinUnet 更适合短序列和局部接触数较少的残基。此外,还提出了一种损失加权方法,作为提高稀有二级结构准确性的有效方法。