Department of Bioengineering, University of Pennsylvania, USA.
Brief Bioinform. 2011 Jan;12(1):1-9. doi: 10.1093/bib/bbq008. Epub 2010 Mar 4.
Classification of mitochondrial DNA (mtDNA) into their respective haplogroups allows the addressing of various anthropologic and forensic issues. Unique to mtDNA is its abundance and non-recombining uni-parental mode of inheritance; consequently, mutations are the only changes observed in the genetic material. These individual mutations are classified into their cladistic haplogroups allowing the tracing of different genetic branch points in human (and other organisms) evolution. Due to the large number of samples, it becomes necessary to automate the classification process. Using 5-fold cross-validation, we investigated two classification techniques on the consented database of 21 141 samples published by the Genographic project. The support vector machines (SVM) algorithm achieved a macro-accuracy of 88.06% and micro-accuracy of 96.59%, while the random forest (RF) algorithm achieved a macro-accuracy of 87.35% and micro-accuracy of 96.19%. In addition to being faster and more memory-economic in making predictions, SVM and RF are better than or comparable to the nearest-neighbor method employed by the Genographic project in terms of prediction accuracy.
将线粒体 DNA(mtDNA)分类为其各自的单倍群,可以解决各种人类学和法医学问题。mtDNA 独特之处在于其丰富性和非重组的单亲遗传模式;因此,突变是遗传物质中唯一观察到的变化。这些个体突变被分类为它们的系统发育单倍群,允许追踪人类(和其他生物体)进化中的不同遗传分支点。由于样本数量众多,因此需要自动化分类过程。使用 5 倍交叉验证,我们在 Genographic 项目发布的 21141 个样本的同意数据库上研究了两种分类技术。支持向量机(SVM)算法的宏观准确性为 88.06%,微观准确性为 96.59%,而随机森林(RF)算法的宏观准确性为 87.35%,微观准确性为 96.19%。SVM 和 RF 不仅在进行预测时速度更快、内存效率更高,而且在预测准确性方面也优于或可与 Genographic 项目使用的最近邻方法相媲美。