IEEE/ACM Trans Comput Biol Bioinform. 2020 Nov-Dec;17(6):1918-1931. doi: 10.1109/TCBB.2019.2911677. Epub 2020 Dec 8.
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
作为基于机器学习的蛋白质结构和功能预测的第一步,氨基酸编码在这些方法的最终成功中起着至关重要的作用。与蛋白质序列编码不同,氨基酸编码可以通过与不同的算法结合,应用于残基水平和序列水平的蛋白质性质预测。然而,在过去的几十年中,它并没有引起足够的重视,到目前为止,还没有关于编码方法的全面综述和评估。在本文中,我们进行了系统的分类,并对各种氨基酸编码方法进行了全面的综述和评估。这些方法根据其信息来源和信息提取方法被分为五类,包括二进制编码、理化性质编码、基于进化的编码、基于结构的编码和基于机器学习的编码。然后,我们从五类中选择了 16 种有代表性的方法,并使用大规模的基准数据集在蛋白质二级结构预测和蛋白质折叠识别任务上进行了比较。结果表明,基于进化的位置相关编码方法 PSSM 表现出了最好的性能,基于结构和基于机器学习的编码方法也显示出了进一步应用的潜力,特别是基于神经网络的氨基酸分布式表示方法可能为这一领域带来新的曙光。我们希望本综述和评估对未来的氨基酸编码研究有用。