基于统计几何学，使用随机森林和神经模糊分类器预测非同义单核苷酸多态性的功能效应

Barenboim Maxim, Masso Majid, Vaisman Iosif I, Jamison D Curtis

Department of Bioinformatics and Computational Biology, George Mason University, Manassas, Virginia 20110, USA.

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http://rna.gmu.edu/FuzzySnps/).

鉴于非同义单核苷酸多态性（nsSNPs）与遗传性疾病的潜在关系，人们对旨在预测其对蛋白质功能影响的方法有着浓厚兴趣。当前最先进的监督式机器学习算法，如随机森林（RF），训练的模型将蛋白质中的单个氨基酸突变分类为对功能中性或有害。然而，多态性对蛋白质的功能影响常常介于这两个极端之间。利用纳入模糊逻辑的分类器可自然扩展，以考虑可能的功能后果范围。我们生成了一个具有已知三维结构的人类蛋白质单氨基酸替换数据集。每个变体都被独特地表示为一个特征向量，其中包括通过对蛋白质结构应用德劳内三角剖分获得的计算几何和基于知识的统计势预测器。其他属性包括天然和替换氨基酸的物理化学性质以及已解析结构中突变残基位置的拓扑位置。在由从我们的数据集中选取的疾病相关和中性nsSNPs组成的训练集上评估了RF算法的分类性能，并根据属性的相对重要性对其进行了排名。同样，我们评估了自适应神经模糊推理系统（ANFIS）的性能。将统计几何预测器的效用与其他研究人员使用的传统结构和进化属性的效用进行了比较，揭示了一种同样有效但互补的方法。在我们特征集中的所有属性中，统计几何预测器被发现排名最高。基于性能的AUC（ROC曲线下面积）测量，当仅使用统计几何特征时，ANFIS和RF模型同样有效。评估AUC、平衡错误率（BER）和马修斯相关系数（MCC）的十折交叉验证研究表明，我们的RF模型至少与成熟的SIFT和PolyPhen方法相当。随后分别使用训练好的RF和ANFIS模型来预测我们数据集中目前未分类的人类nsSNPs的疾病潜力（http://rna.gmu.edu/FuzzySnps/）。

相似文献

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information.

Bioinformatics. 2005 May 15;21(10):2185-90. doi: 10.1093/bioinformatics/bti365. Epub 2005 Mar 3.

Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms.

J Theor Biol. 2010 Oct 21;266(4):560-8. doi: 10.1016/j.jtbi.2010.07.026. Epub 2010 Jul 23.

Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis.

Bioinformatics. 2008 Sep 15;24(18):2002-9. doi: 10.1093/bioinformatics/btn353. Epub 2008 Jul 16.

Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature.

Proteins. 2011 Apr;79(4):1230-9. doi: 10.1002/prot.22958. Epub 2011 Jan 25.

A bioinformatics approach for the phenotype prediction of nonsynonymous single nucleotide polymorphisms in human cytochromes P450.

Drug Metab Dispos. 2009 May;37(5):977-91. doi: 10.1124/dmd.108.026047. Epub 2009 Feb 9.

Statistical geometry approach to the study of functional effects of human nonsynonymous SNPs.

Hum Mutat. 2005 Nov;26(5):471-6. doi: 10.1002/humu.20238.

Accurate prediction of enzyme mutant activity based on a multibody statistical potential.

Bioinformatics. 2007 Dec 1;23(23):3155-61. doi: 10.1093/bioinformatics/btm509. Epub 2007 Oct 31.

Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP).

Bioinformatics. 2007 Jun 15;23(12):1444-50. doi: 10.1093/bioinformatics/btm119. Epub 2007 Mar 24.

Prediction of deleterious functional effects of amino acid mutations using a library of structure-based function descriptors.

Proteins. 2003 Dec 1;53(4):806-16. doi: 10.1002/prot.10458.

引用本文的文献

Assigning function to natural allelic variation via dynamic modeling of gene network induction.

Mol Syst Biol. 2018 Jan 15;14(1):e7803. doi: 10.15252/msb.20177803.

Revealing the Effects of Missense Mutations Causing Snyder-Robinson Syndrome on the Stability and Dimerization of Spermine Synthase.

Int J Mol Sci. 2016 Jan 8;17(1):77. doi: 10.3390/ijms17010077.

GESPA: classifying nsSNPs to predict disease association.

BMC Bioinformatics. 2015 Jul 25;16:228. doi: 10.1186/s12859-015-0673-2.

Analysis of genetic variation and potential applications in genome-scale metabolic modeling.

Front Bioeng Biotechnol. 2015 Feb 16;3:13. doi: 10.3389/fbioe.2015.00013. eCollection 2015.

Determining effects of non-synonymous SNPs on protein-protein interactions using supervised and semi-supervised learning.

PLoS Comput Biol. 2014 May 1;10(5):e1003592. doi: 10.1371/journal.pcbi.1003592. eCollection 2014 May.

Functional hot spots in human ATP-binding cassette transporter nucleotide binding domains.

Protein Sci. 2010 Nov;19(11):2110-21. doi: 10.1002/pro.491.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information.

Bioinformatics. 2005 May 15;21(10):2185-90. doi: 10.1093/bioinformatics/bti365. Epub 2005 Mar 3.

Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms.

J Theor Biol. 2010 Oct 21;266(4):560-8. doi: 10.1016/j.jtbi.2010.07.026. Epub 2010 Jul 23.

Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis.

Bioinformatics. 2008 Sep 15;24(18):2002-9. doi: 10.1093/bioinformatics/btn353. Epub 2008 Jul 16.

Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature.

Proteins. 2011 Apr;79(4):1230-9. doi: 10.1002/prot.22958. Epub 2011 Jan 25.

A bioinformatics approach for the phenotype prediction of nonsynonymous single nucleotide polymorphisms in human cytochromes P450.

Drug Metab Dispos. 2009 May;37(5):977-91. doi: 10.1124/dmd.108.026047. Epub 2009 Feb 9.

Statistical geometry approach to the study of functional effects of human nonsynonymous SNPs.

Hum Mutat. 2005 Nov;26(5):471-6. doi: 10.1002/humu.20238.

Accurate prediction of enzyme mutant activity based on a multibody statistical potential.

Bioinformatics. 2007 Dec 1;23(23):3155-61. doi: 10.1093/bioinformatics/btm509. Epub 2007 Oct 31.

Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP).

Bioinformatics. 2007 Jun 15;23(12):1444-50. doi: 10.1093/bioinformatics/btm119. Epub 2007 Mar 24.

Prediction of deleterious functional effects of amino acid mutations using a library of structure-based function descriptors.

Proteins. 2003 Dec 1;53(4):806-16. doi: 10.1002/prot.10458.

引用本文的文献

Assigning function to natural allelic variation via dynamic modeling of gene network induction.

Mol Syst Biol. 2018 Jan 15;14(1):e7803. doi: 10.15252/msb.20177803.

Revealing the Effects of Missense Mutations Causing Snyder-Robinson Syndrome on the Stability and Dimerization of Spermine Synthase.

Int J Mol Sci. 2016 Jan 8;17(1):77. doi: 10.3390/ijms17010077.

GESPA: classifying nsSNPs to predict disease association.

BMC Bioinformatics. 2015 Jul 25;16:228. doi: 10.1186/s12859-015-0673-2.

Analysis of genetic variation and potential applications in genome-scale metabolic modeling.

Front Bioeng Biotechnol. 2015 Feb 16;3:13. doi: 10.3389/fbioe.2015.00013. eCollection 2015.

Determining effects of non-synonymous SNPs on protein-protein interactions using supervised and semi-supervised learning.

PLoS Comput Biol. 2014 May 1;10(5):e1003592. doi: 10.1371/journal.pcbi.1003592. eCollection 2014 May.

Functional hot spots in human ATP-binding cassette transporter nucleotide binding domains.

Protein Sci. 2010 Nov;19(11):2110-21. doi: 10.1002/pro.491.

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献