Bao Lei, Cui Yan
Department of Molecular Sciences, Center of Genomics and Bioinformatics, University of Tennessee Health Science Center, 858 Madison Avenue, Memphis, TN 38163, USA.
Bioinformatics. 2005 May 15;21(10):2185-90. doi: 10.1093/bioinformatics/bti365. Epub 2005 Mar 3.
There has been great expectation that the knowledge of an individual's genotype will provide a basis for assessing susceptibility to diseases and designing individualized therapy. Non-synonymous single nucleotide polymorphisms (nsSNPs) that lead to an amino acid change in the protein product are of particular interest because they account for nearly half of the known genetic variations related to human inherited diseases. To facilitate the identification of disease-associated nsSNPs from a large number of neutral nsSNPs, it is important to develop computational tools to predict the phenotypic effects of nsSNPs.
We prepared a training set based on the variant phenotypic annotation of the Swiss-Prot database and focused our analysis on nsSNPs having homologous 3D structures. Structural environment parameters derived from the 3D homologous structure as well as evolutionary information derived from the multiple sequence alignment were used as predictors. Two machine learning methods, support vector machine and random forest, were trained and evaluated. We compared the performance of our method with that of the SIFT algorithm, which is one of the best predictive methods to date. An unbiased evaluation study shows that for nsSNPs with sufficient evolutionary information (with not <10 homologous sequences), the performance of our method is comparable with the SIFT algorithm, while for nsSNPs with insufficient evolutionary information (<10 homologous sequences), our method outperforms the SIFT algorithm significantly. These findings indicate that incorporating structural information is critical to achieving good prediction accuracy when sufficient evolutionary information is not available.
The codes and curated dataset are available at http://compbio.utmem.edu/snp/dataset/
人们一直寄予厚望,认为个体基因型知识将为评估疾病易感性和设计个性化治疗提供依据。导致蛋白质产物中氨基酸变化的非同义单核苷酸多态性(nsSNPs)尤其令人关注,因为它们占已知与人类遗传性疾病相关的遗传变异的近一半。为了便于从大量中性nsSNPs中识别与疾病相关的nsSNPs,开发计算工具来预测nsSNPs的表型效应很重要。
我们基于Swiss-Prot数据库的变异表型注释准备了一个训练集,并将分析重点放在具有同源三维结构的nsSNPs上。从三维同源结构导出的结构环境参数以及从多序列比对导出的进化信息被用作预测因子。对支持向量机和随机森林这两种机器学习方法进行了训练和评估。我们将我们方法的性能与SIFT算法(迄今为止最好的预测方法之一)的性能进行了比较。一项无偏评估研究表明,对于具有足够进化信息(同源序列不少于10个)的nsSNPs,我们方法的性能与SIFT算法相当,而对于进化信息不足(同源序列少于10个)的nsSNPs,我们的方法明显优于SIFT算法。这些发现表明,当没有足够的进化信息时,纳入结构信息对于实现良好的预测准确性至关重要。
代码和经过整理的数据集可在http://compbio.utmem.edu/snp/dataset/获取。