Wang Boshen, Perez-Rathke Alan, Li Renhao, Liang Jie
Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA.
Aflac Cancer and Blood Disorders Center, Department of Pediatrics, Emory University School of Medicine, Atlanta, GA 30322, USA.
IEEE EMBS Int Conf Biomed Health Inform. 2018 Mar;2018:341-344. doi: 10.1109/BHI.2018.8333438. Epub 2018 Apr 9.
Information on protein hydrogen exchange can help delineate key regions involved in protein-protein interactions and provides important insight towards determining functional roles of genetic variants and their possible mechanisms in disease processes. Previous studies have shown that the degree of hydrogen exchange is affected by hydrogen bond formations, solvent accessibility, proximity to other residues, and experimental conditions. However, a general predictive method for identifying residues capable of hydrogen exchange transferable to a broad set of proteins is lacking. We have developed a machine learning method based on random forest that can predict whether a residue experiences hydrogen exchange. Using data from the Start2Fold database, which contains information on 13,306 residues (3,790 of which experience hydrogen exchange and 9,516 which do not exchange), our method achieves good performance. Specifically, we achieve an overall out-of-bag (OOB) error, an unbiased estimate of the test set error, of 20.3 percent. Using a randomly selected test data set consisting of 500 residues experiencing hydrogen exchange and 500 which do not, our method achieves an accuracy of 0.79, a recall of 0.74, a precision of 0.82, and an F1 score of 0.78.
蛋白质氢交换的信息有助于描绘参与蛋白质-蛋白质相互作用的关键区域,并为确定基因变异的功能作用及其在疾病过程中的可能机制提供重要见解。先前的研究表明,氢交换的程度受氢键形成、溶剂可及性、与其他残基的接近程度以及实验条件的影响。然而,目前缺乏一种可广泛应用于多种蛋白质的、用于识别能够进行氢交换的残基的通用预测方法。我们开发了一种基于随机森林的机器学习方法,该方法可以预测一个残基是否会发生氢交换。利用来自Start2Fold数据库的数据(该数据库包含13306个残基的信息,其中3790个残基发生氢交换,9516个残基不发生交换),我们的方法取得了良好的性能。具体而言,我们得到的总体袋外(OOB)误差(测试集误差的无偏估计)为20.3%。使用一个由500个发生氢交换的残基和500个不发生氢交换的残基组成的随机选择的测试数据集,我们的方法实现了0.79的准确率、0.74的召回率、0.82的精确率和0.78的F1分数。