对用于蛋白质突变稳定性预测的机器学习模型中[公式：见正文]截止范围的系统探索。

A systematic exploration of [Formula: see text] cutoff ranges in machine learning models for protein mutation stability prediction.

作者信息

Olney Richard, Tuor Aaron, Jagodzinski Filip, Hutchinson Brian

机构信息

* Western Washington University, Bellingham, WA, USA.

† Pacific Northwest National Laboratory, Seattle, WA, USA.

出版信息

J Bioinform Comput Biol. 2018 Oct;16(5):1840022. doi: 10.1142/S021972001840022X.

DOI:10.1142/S021972001840022X

PMID:30419784

Abstract

Discerning how a mutation affects the stability of a protein is central to the study of a wide range of diseases. Mutagenesis experiments on physical proteins provide precise insights about the effects of amino acid substitutions, but such studies are time and cost prohibitive. Computational approaches for informing experimentalists where to allocate wet-lab resources are available, including a variety of machine learning models. Assessing the accuracy of machine learning models for predicting the effects of mutations is dependent on experiments for amino acid substitutions performed in vitro. When similar experiments on physical proteins have been performed by multiple laboratories, the use of the data near the juncture of stabilizing and destabilizing mutations is questionable. In this work, we explore a systematic and principled alternative to discarding experimental data close to the juncture of stabilizing and destabilizing mutations. We model the inconclusive range of experimental [Formula: see text] values via 3- and 5-way classifiers, and systematically explore potential boundaries for the range of inconclusive experimental values. We demonstrate the effectiveness of potential boundaries through confusion matrices and heat map visualizations. We explore two novel metrics for assessing viable cutoff ranges, and find that under these metrics, a lower cutoff near [Formula: see text] and an upper cutoff near [Formula: see text] are optimal across multiple machine learning models.

摘要

识别突变如何影响蛋白质的稳定性是广泛疾病研究的核心。对物理蛋白质进行诱变实验可提供有关氨基酸取代效应的精确见解，但此类研究耗时且成本高昂。有一些计算方法可告知实验人员在何处分配湿实验室资源，包括各种机器学习模型。评估机器学习模型预测突变效应的准确性依赖于体外进行的氨基酸取代实验。当多个实验室对物理蛋白质进行了类似实验时，使用接近稳定和不稳定突变交界处的数据就存在疑问。在这项工作中，我们探索了一种系统且有原则的替代方法，以避免丢弃接近稳定和不稳定突变交界处的实验数据。我们通过三分类和五分类器对实验[公式：见正文]值的不确定范围进行建模，并系统地探索不确定实验值范围的潜在边界。我们通过混淆矩阵和热图可视化展示了潜在边界的有效性。我们探索了两种用于评估可行截断范围的新指标，发现在这些指标下，对于多个机器学习模型而言，接近[公式：见正文]的较低截断值和接近[公式：见正文]的较高截断值是最优的。