IFM Bioinformatics, Linköping University, Linköping, Sweden.
PLoS One. 2019 Aug 15;14(8):e0220182. doi: 10.1371/journal.pone.0220182. eCollection 2019.
In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed taking advantage of these new architectures. However, most methods are still heavily based pre-processing of the input data, as well as extraction and integration of multiple hand-picked, and manually designed features. Multiple Sequence Alignments (MSA) are the most common source of information in de novo prediction methods. Deep Networks that automatically refine the MSA and extract useful features from it would be immensely powerful. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering pre-calculated features such as sequence profiles and other features calculated from MSA obsolete. We showcased the rawMSA methodology on three different prediction problems: secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on par with methods using more pre-calculated features in the inter-residue contact map prediction category in CASP12 and CASP13. Clearly demonstrating that rawMSA represents a promising development that can pave the way for improved methods using rawMSA instead of sequence profiles to represent evolutionary information in the coming years. Availability: datasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa.
在过去的几十年中,生物信息学领域做出了巨大的努力,开发了基于机器学习的方法来预测蛋白质的结构特征,以期回答关于蛋白质功能及其在多种疾病中作用的基本问题。深度学习的出现重新激发了人们对神经网络的兴趣,开发了数十种利用这些新架构的方法。然而,大多数方法仍然严重依赖于输入数据的预处理,以及提取和整合多个手工挑选的、手动设计的特征。多重序列比对(MSA)是从头预测方法中最常见的信息来源。能够自动改进 MSA 并从中提取有用特征的深度网络将具有巨大的威力。在这项工作中,我们提出了一种称为原始 MSA 的蛋白质结构特征预测的新范例。原始 MSA 的核心思想来自自然语言处理领域,将氨基酸序列映射到自适应学习的连续空间中。这允许将整个 MSA 输入到深度网络中,从而使预先计算的特征(如序列谱和从 MSA 计算的其他特征)变得过时。我们在三个不同的预测问题上展示了 rawMSA 方法:二级结构、相对溶剂可及性和残基间接触图。我们已经在大量蛋白质上严格训练和基准测试了 rawMSA,并确定它在预测二级结构和溶剂可及性方面优于基于位置特异性评分矩阵(PSSM)的经典方法,而在 CASP12 和 CASP13 的残基间接触图预测类别中使用更多预先计算的特征的方法表现相当。这清楚地表明,rawMSA 代表了一种有前途的发展,它可以为未来几年使用 rawMSA 而不是序列谱来表示进化信息的改进方法铺平道路。
数据集、数据集生成代码、评估代码和模型可在以下网址获得:https://bitbucket.org/clami66/rawmsa。