Huang Hsin-Hsiung, Yu Chenglong
Department of Statistics, University of Central Florida, Orlando, FL 32816, USA.
Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia.
J Theor Biol. 2016 Oct 7;406:61-72. doi: 10.1016/j.jtbi.2016.06.029. Epub 2016 Jun 29.
The alignment-free n-gram based method with the out-of-place measures as the distance has been successfully applied to automatic text or natural languages categorization in real time. However, it is not clear about its performance and the selection of n for comparing genome sequences. Here we propose a symmetric version of the out-of-place measure and a new approach for finding the optimal range of n to construct a phylogenetic tree with the symmetric out-of-place measures. Our method is then applied to real genome sequence datasets. The resulting phylogenetic trees are matching with the standard biological classification. It shows that our proposed method is a very powerful tool for phylogenetic analysis in terms of both classification accuracy and computation efficiency.
基于无比对的n元语法方法,以错位度量作为距离,已成功应用于实时自动文本或自然语言分类。然而,其在比较基因组序列时的性能以及n的选择尚不清楚。在此,我们提出了错位度量的对称版本以及一种寻找n的最优范围的新方法,以用对称错位度量构建系统发育树。然后我们将我们的方法应用于真实的基因组序列数据集。所得的系统发育树与标准生物学分类相匹配。这表明我们提出的方法在分类准确性和计算效率方面都是用于系统发育分析的非常强大的工具。