Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China.
National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China.
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac401.
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
在计算生物学中,深入了解蛋白质的功能和结构有助于我们理解人类。为了应对结构和功能上有注释的有限蛋白质,科学界采用了自监督的预训练方法,从大量未标记的蛋白质序列中学习蛋白质嵌入。然而,蛋白质通常由具有有限词汇量的单个氨基酸表示(例如 20 种类型的蛋白质),而没有考虑蛋白质序列中存在的强局部语义。在这项工作中,我们提出了一种新的预训练建模方法 SPRoBERTa。我们首先提出了一种无监督的蛋白质标记器,用于学习具有局部片段模式的蛋白质表示。然后,引入了一种新的深度预训练模型框架,用于学习蛋白质嵌入。预训练后,我们的方法可以轻松地针对不同的蛋白质任务进行微调,包括氨基酸级别的预测任务(例如二级结构预测)、氨基酸对级别的预测任务(例如接触预测)以及蛋白质级别的预测任务(远程同源预测、蛋白质功能预测)。实验表明,我们的方法在所有任务中都取得了显著的改进,优于以前的方法。我们还提供了关于我们的蛋白质标记器和训练框架的详细消融研究和分析。