Matukumalli Lakshmi K, Grefenstette John J, Hyten David L, Choi Ik-Young, Cregan Perry B, Van Tassell Curtis P
Beltsville Agricultural Research Center, Bovine Functional Genomics Laboratory, US Department of Agriculture, ARS, Beltsville, MD 20705, USA.
BMC Bioinformatics. 2006 Jan 6;7:4. doi: 10.1186/1471-2105-7-4.
Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.
The ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes.
A machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5-10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.
单核苷酸多态性(SNP)构成了超过90%的遗传变异,因此可以解释给定物种中个体间的大多数性状差异。多态性检测软件PolyBayes和PolyPhred即使在参数值严格的情况下也会给出较高的假阳性SNP预测结果。我们开发了一种机器学习(ML)方法来增强PolyBayes,以提高其预测准确性。ML方法也已成功应用于预测基因、启动子、转录因子结合位点和蛋白质结构等其他生物信息学问题。
将ML程序C4.5应用于一组特征,以便根据人类专家的判断(真/假)从训练数据构建SNP分类器。训练数据是通过对来自6个不同纯合大豆品种的1973个序列标签位点(STS)(12兆碱基)进行双向测序以及PolyBayes分析产生的27275个候选SNP。18390个候选SNP的测试数据同样来自另外1359个STS(8兆碱基)。两组中的SNP均由专家进行分类。在训练ML分类器后,它与专家对97.3%的测试数据的判断一致,而PolyBayes与专家的判断一致率为7.8%。PolyBayes的阳性预测值(PPV)(即候选SNP为真实的比例)在所有预测中为7.8%,在那些后验概率为100%为真实的预测中为16.7%。使用ML将PPV提高到84.8%,增加了5到10倍。虽然ML和PolyBayes产生的真阳性数量相似,但ML程序仅产生了249个假阳性,而PolyBayes产生了16955个。大豆基因组的复杂性可能导致了PolyBayes的高假SNP预测结果,因此其他基因组的结果可能会有所不同。
开发了一种机器学习(ML)方法作为多态性检测软件的补充功能,以提高预测准确性。本研究结果表明,经过训练的ML分类器可以显著减少人工干预,在这种情况下生产力提高了5到10倍。优化后的特征集和ML框架也可以应用于所有多态性发现软件。ML支持软件用Perl编写,可以很容易地集成到现有的SNP发现流程中。