Suppr超能文献

基于统计几何学,使用随机森林和神经模糊分类器预测非同义单核苷酸多态性的功能效应

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

作者信息

Barenboim Maxim, Masso Majid, Vaisman Iosif I, Jamison D Curtis

机构信息

Department of Bioinformatics and Computational Biology, George Mason University, Manassas, Virginia 20110, USA.

出版信息

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

Abstract

There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http://rna.gmu.edu/FuzzySnps/).

摘要

鉴于非同义单核苷酸多态性(nsSNPs)与遗传性疾病的潜在关系,人们对旨在预测其对蛋白质功能影响的方法有着浓厚兴趣。当前最先进的监督式机器学习算法,如随机森林(RF),训练的模型将蛋白质中的单个氨基酸突变分类为对功能中性或有害。然而,多态性对蛋白质的功能影响常常介于这两个极端之间。利用纳入模糊逻辑的分类器可自然扩展,以考虑可能的功能后果范围。我们生成了一个具有已知三维结构的人类蛋白质单氨基酸替换数据集。每个变体都被独特地表示为一个特征向量,其中包括通过对蛋白质结构应用德劳内三角剖分获得的计算几何和基于知识的统计势预测器。其他属性包括天然和替换氨基酸的物理化学性质以及已解析结构中突变残基位置的拓扑位置。在由从我们的数据集中选取的疾病相关和中性nsSNPs组成的训练集上评估了RF算法的分类性能,并根据属性的相对重要性对其进行了排名。同样,我们评估了自适应神经模糊推理系统(ANFIS)的性能。将统计几何预测器的效用与其他研究人员使用的传统结构和进化属性的效用进行了比较,揭示了一种同样有效但互补的方法。在我们特征集中的所有属性中,统计几何预测器被发现排名最高。基于性能的AUC(ROC曲线下面积)测量,当仅使用统计几何特征时,ANFIS和RF模型同样有效。评估AUC、平衡错误率(BER)和马修斯相关系数(MCC)的十折交叉验证研究表明,我们的RF模型至少与成熟的SIFT和PolyPhen方法相当。随后分别使用训练好的RF和ANFIS模型来预测我们数据集中目前未分类的人类nsSNPs的疾病潜力(http://rna.gmu.edu/FuzzySnps/)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验