Department of Mathematics, Michigan State University, MI 48824, USA.
Phys Chem Chem Phys. 2020 Feb 26;22(8):4343-4367. doi: 10.1039/c9cp06554g.
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.
最近,机器学习(ML)已经在计算生物学的各种全球基准竞赛中确立了自己的地位,包括结构预测的关键评估(CASP)和药物设计数据资源(D3R)大挑战。然而,生物分子数据集的复杂结构复杂性和高 ML 维度阻碍了 ML 算法在该领域的有效应用。除了数据和算法外,用于生物分子预测的高效 ML 机制还必须将结构表示作为不可或缺的组成部分。简化生物分子结构复杂性并降低 ML 维度的数学表示已成为 D3R 大挑战中的主要赢家。这篇综述致力于在我们实验室中开发生物分子的低维且可扩展的数学表示形式的最新进展。我们讨论了三类数学方法,包括代数拓扑,微分几何和图论。我们阐明了物理和生物挑战如何指导这些数学工具的发展,以适应大规模和多样化的生物分子数据。尽管这些方法在许多其他应用中取得了巨大成功,例如蛋白质分类,虚拟筛选以及对溶解度,溶剂化自由能,毒性,分配系数,突变后蛋白质折叠稳定性变化的预测等,但我们在本文的综述中重点分析了蛋白质-配体结合预测的性能。