Pan Yuliang, Liu Diwei, Deng Lei
School of Software, Central South University, Changsha, China.
Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, China.
PLoS One. 2017 Jun 14;12(6):e0179314. doi: 10.1371/journal.pone.0179314. eCollection 2017.
Single amino acid variations (SAVs) potentially alter biological functions, including causing diseases or natural differences between individuals. Identifying the relationship between a SAV and certain disease provides the starting point for understanding the underlying mechanisms of specific associations, and can help further prevention and diagnosis of inherited disease.We propose PredSAV, a computational method that can effectively predict how likely SAVs are to be associated with disease by incorporating gradient tree boosting (GTB) algorithm and optimally selected neighborhood features. A two-step feature selection approach is used to explore the most relevant and informative neighborhood properties that contribute to the prediction of disease association of SAVs across a wide range of sequence and structural features, especially some novel structural neighborhood features. In cross-validation experiments on the benchmark dataset, PredSAV achieves promising performances with an AUC score of 0.908 and a specificity of 0.838, which are significantly better than that of the other existing methods. Furthermore, we validate the capability of our proposed method by an independent test and gain a competitive advantage as a result. PredSAV, which combines gradient tree boosting with optimally selected neighborhood features, can return reliable predictions in distinguishing between disease-associated and neutral variants. Compared with existing methods, PredSAV shows improved specificity as well as increased overall performance.
单氨基酸变异(SAVs)可能会改变生物学功能,包括引发疾病或导致个体间的自然差异。确定SAV与特定疾病之间的关系是理解特定关联潜在机制的起点,并且有助于进一步预防和诊断遗传性疾病。我们提出了PredSAV,这是一种计算方法,通过结合梯度树提升(GTB)算法和最优选择的邻域特征,能够有效预测SAVs与疾病相关联的可能性。采用两步特征选择方法来探索最相关且信息量最大的邻域属性,这些属性有助于在广泛的序列和结构特征(特别是一些新颖的结构邻域特征)范围内预测SAVs的疾病关联性。在基准数据集的交叉验证实验中,PredSAV取得了良好的性能,AUC得分为0.908,特异性为0.838,显著优于其他现有方法。此外,我们通过独立测试验证了所提出方法的能力,并因此获得了竞争优势。PredSAV将梯度树提升与最优选择的邻域特征相结合,在区分疾病相关变异和中性变异时能够给出可靠的预测。与现有方法相比,PredSAV显示出更高的特异性以及整体性能的提升。