Computational & Systems Biology Branch, Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD 20850, USA.
Brief Bioinform. 2020 Jul 15;21(4):1285-1292. doi: 10.1093/bib/bbz071.
A number of machine learning (ML)-based algorithms have been proposed for predicting mutation-induced stability changes in proteins. In this critical review, we used hypothetical reverse mutations to evaluate the performance of five representative algorithms and found all of them suffer from the problem of overfitting. This approach is based on the fact that if a wild-type protein is more stable than a mutant protein, then the same mutant is less stable than the wild-type protein. We analyzed the underlying issues and suggest that the main causes of the overfitting problem include that the numbers of training cases were too small, and the features used in the models were not sufficiently informative for the task. We make recommendations on how to avoid overfitting in this important research area and improve the reliability and robustness of ML-based algorithms in general.
许多基于机器学习 (ML) 的算法已经被提出,用于预测蛋白质突变引起的稳定性变化。在这篇批判性评论中,我们使用假设的反向突变来评估五种有代表性的算法的性能,发现它们都存在过拟合的问题。这种方法基于这样一个事实,即如果野生型蛋白质比突变型蛋白质更稳定,那么相同的突变体比野生型蛋白质更不稳定。我们分析了潜在的问题,并提出过拟合问题的主要原因包括训练案例的数量太少,以及模型中使用的特征对任务的信息量不足。我们就如何避免这一重要研究领域中的过拟合问题以及如何提高基于 ML 的算法的可靠性和稳健性提出了建议。