Mardikoraem Mehrsa, Woldring Daniel
Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI, USA.
Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI, USA.
Methods Mol Biol. 2022;2491:87-104. doi: 10.1007/978-1-0716-2285-8_5.
Proteins are small yet valuable biomolecules that play a versatile role in therapeutics and diagnostics. The intricate sequence-structure-function paradigm in the realm of proteins opens the possibility for directly mapping amino acid sequence to function. However, the rugged nature of the protein fitness landscape and an astronomical number of possible mutations even for small proteins make navigating this system a daunting task. Moreover, the scarcity of functional proteins and the ease with which deleterious mutations are introduced, due to complex epistatic relationships, compound the existing challenges. This highlights the need for auxiliary tools in current techniques such as rational design and directed evolution. To that end, the state-of-the-art machine learning can offer time and cost efficiency in finding high fitness proteins, circumventing unnecessary wet-lab experiments. In the context of improving library design, machine learning provides valuable insights via its unique features such as high adaptation to complex systems, multi-tasking, and parallelism, and the ability to capture hidden trends in input data. Finally, both the advancements in computational resources and the rapidly increasing number of sequences in protein databases will allow more promising and detailed insights delivered from machine learning to protein library design. In this chapter, fundamental concepts and a method for machine learning-driven library design leveraging deep sequencing datasets will be discussed. We elaborate on (1) basic knowledge about machine learning algorithms, (2) the benefit of machine learning in library design, and (3) methodology for implementing machine learning in library design.
蛋白质是微小却有价值的生物分子,在治疗学和诊断学中发挥着多种作用。蛋白质领域中复杂的序列 - 结构 - 功能范式为直接将氨基酸序列映射到功能提供了可能性。然而,蛋白质适应性景观的崎岖性质以及即使对于小蛋白质来说数量庞大的可能突变,使得驾驭这个系统成为一项艰巨的任务。此外,由于复杂的上位关系,功能性蛋白质的稀缺以及有害突变容易引入,加剧了现有的挑战。这凸显了当前技术(如理性设计和定向进化)中辅助工具的必要性。为此,最先进的机器学习可以在寻找高适应性蛋白质方面提供时间和成本效率,避免不必要的湿实验室实验。在改进文库设计的背景下,机器学习通过其独特的特性(如对复杂系统的高度适应性、多任务处理、并行性以及捕捉输入数据中隐藏趋势的能力)提供有价值的见解。最后,计算资源的进步和蛋白质数据库中序列数量的迅速增加,将使机器学习为蛋白质文库设计提供更有前景和详细的见解。在本章中,将讨论机器学习驱动文库设计的基本概念和一种利用深度测序数据集的方法。我们将详细阐述:(1)关于机器学习算法的基础知识;(2)机器学习在文库设计中的益处;(3)在文库设计中实施机器学习的方法。