Department of Chemistry, The University of Texas at Austin, 105 E 24TH St., Austin, 78712, Texas, USA; Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. Electronic address: https://twitter.com/aiproteins.
Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA.
Curr Opin Struct Biol. 2023 Feb;78:102518. doi: 10.1016/j.sbi.2022.102518. Epub 2023 Jan 3.
Machine and deep learning approaches can leverage the increasingly available massive datasets of protein sequences, structures, and mutational effects to predict variants with improved fitness. Many different approaches are being developed, but systematic benchmarking studies indicate that even though the specifics of the machine learning algorithms matter, the more important constraint comes from the data availability and quality utilized during training. In cases where little experimental data are available, unsupervised and self-supervised pre-training with generic protein datasets can still perform well after subsequent refinement via hybrid or transfer learning approaches. Overall, recent progress in this field has been staggering, and machine learning approaches will likely play a major role in future breakthroughs in protein biochemistry and engineering.
机器学习和深度学习方法可以利用日益丰富的蛋白质序列、结构和突变效应的海量数据集来预测具有更高适应性的变体。目前正在开发许多不同的方法,但系统的基准测试研究表明,尽管机器学习算法的具体细节很重要,但更重要的限制因素来自于训练过程中使用的数据的可用性和质量。在实验数据很少的情况下,使用通用蛋白质数据集进行无监督和自监督预训练,仍然可以在后续通过混合或转移学习方法进行细化后取得良好的效果。总的来说,该领域的最新进展令人瞩目,机器学习方法很可能在未来的蛋白质生物化学和工程学的突破中发挥重要作用。