School of Biotechnology, East China University of Science and Technology, Shanghai 200237, China.
Biochem Biophys Res Commun. 2012 Mar 2;419(1):99-103. doi: 10.1016/j.bbrc.2012.01.138. Epub 2012 Feb 4.
Many non-synonymous SNPs (nsSNPs) are associated with diseases, and numerous machine learning methods have been applied to train classifiers for sorting disease-associated nsSNPs from neutral ones. The continuously accumulated nsSNP data allows us to further explore better prediction approaches. In this work, we partitioned the training data into 20 subsets according to either original or substituted amino acid type at the nsSNP site. Using support vector machine (SVM), training classification models on each subset resulted in an overall accuracy of 76.3% or 74.9% depending on the two different partition criteria, while training on the whole dataset obtained an accuracy of only 72.6%. Moreover, the dataset was also randomly divided into 20 subsets, but the corresponding accuracy was only 73.2%. Our results demonstrated that partitioning the whole training dataset into subsets properly, i.e., according to the residue type at the nsSNP site, will improve the performance of the trained classifiers significantly, which should be valuable in developing better tools for predicting the disease-association of nsSNPs.
许多非同义 SNP(nsSNP)与疾病相关,许多机器学习方法已被应用于训练分类器,以将与疾病相关的 nsSNP 与中性 SNP 区分开来。不断积累的 nsSNP 数据使我们能够进一步探索更好的预测方法。在这项工作中,我们根据 nsSNP 位点的原始或取代氨基酸类型,将训练数据分为 20 个子集。使用支持向量机(SVM),在每个子集中训练分类模型,得到的整体准确率分别为 76.3%或 74.9%,这取决于两种不同的分区标准,而在整个数据集上训练的准确率仅为 72.6%。此外,我们还将数据集随机分为 20 个子集,但相应的准确率仅为 73.2%。我们的结果表明,将整个训练数据集适当地划分为子集,即根据 nsSNP 位点的残基类型,将显著提高训练分类器的性能,这对于开发更好的预测 nsSNP 疾病相关性的工具应该是有价值的。