Advanced Technology Development Centre, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India.
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India.
J Chem Inf Model. 2020 Dec 28;60(12):6679-6690. doi: 10.1021/acs.jcim.0c00802. Epub 2020 Nov 22.
Insertions/deletions of amino acids in the protein backbone potentially result in altered structural/functional specifications. They can either contribute positively to the evolutionary process or can result in disease conditions. Despite being the second most prevalent form of protein modification, there are no databases or computational frameworks that delineate harmful multipoint deletions (MPD) from beneficial ones. We introduce a positive unlabeled learning-based prediction framework (PROFOUND) that utilizes fold-level attributes, environment-specific properties, and deletion site-specific properties to predict the change in foldability arising from such MPDs, both in the non-loop and loop regions of protein structures. In the absence of any protein structure dataset to study MPDs, we introduce a dataset with 153 MPD instances that lead to native-like folded structures and 7650 unlabeled MPD instances whose effect on the foldability of the corresponding proteins is unknown. PROFOUND on 10-fold cross-validation on our newly introduced dataset reports a recall of 82.2% (86.6%) and a fall out rate (FR) of 14.2% (20.6%), corresponding to MPDs in the protein loop (non-loop) region. The low FR suggests that the foldability in proteins subject to MPDs is not random and necessitates unique specifications of the deleted region. In addition, we find that additional evolutionary attributes contribute to higher recall and lower FR. The first of a kind foldability prediction system owing to MPD instances and the newly introduced dataset will potentially aid in novel protein engineering endeavors.
氨基酸在蛋白质骨架中的插入/缺失可能导致结构/功能特性的改变。它们既可以为进化过程做出积极贡献,也可能导致疾病状况。尽管是蛋白质修饰的第二大常见形式,但目前还没有数据库或计算框架可以区分有害的多点缺失(MPD)和有益的 MPD。我们引入了一种基于正无标签学习的预测框架(PROFOUND),该框架利用折叠水平属性、环境特定属性和删除位置特定属性来预测这些 MPD 引起的折叠变化,包括蛋白质结构的非环和环区域。由于缺乏任何研究 MPD 的蛋白质结构数据集,我们引入了一个包含 153 个 MPD 实例的数据集,这些实例导致了类似天然折叠的结构,以及 7650 个未标记的 MPD 实例,其对相应蛋白质折叠性的影响尚不清楚。PROFOUND 在我们新引入的数据集上进行 10 倍交叉验证的召回率为 82.2%(86.6%),错误率(FR)为 14.2%(20.6%),分别对应于蛋白质环(非环)区域的 MPD。低 FR 表明 MPD 作用下的蛋白质折叠性不是随机的,需要删除区域的独特规范。此外,我们发现额外的进化属性有助于提高召回率和降低 FR。由于 MPD 实例和新引入的数据集,这是第一个折叠性预测系统,可能有助于新的蛋白质工程努力。