Huang Hsin-Hsiung, Yu Chenglong, Zheng Hui, Hernandez Troy, Yau Shek-Chung, He Rong Lucy, Yang Jie, Yau Stephen S-T
Department of Statistics, University of Central Florida, Orlando, FL 32816, USA.
Mind-Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, South Australia 5000, Australia.
Mol Phylogenet Evol. 2014 Dec;81:29-36. doi: 10.1016/j.ympev.2014.08.003. Epub 2014 Aug 27.
We have recently developed a computational approach in a vector space for genome-based virus classification. This approach, called the "Natural Vector (NV) representation", which is an alignment-free method, allows us to classify single-segmented viruses with high speed and accuracy. For multiple-segmented viruses, typically phylogenetic trees of each segment are reconstructed for discovering viral phylogeny. Consensus tree methods may be used to combine the phylogenetic trees based on different segments. However, consensus tree methods were not developed for instances where the viruses have different numbers of segments or where their segments do not match well. We propose a novel approach for comparing multiple-segmented viruses globally, even in cases where viruses contain different numbers of segments. Using our method, each virus is represented by a set of vectors in R(12). The Hausdorff distance is then used to compare different sets of vectors. Phylogenetic trees can be reconstructed based on this distance. The proposed method is used for predicting classification labels of viruses with n-segments (n ⩾ 1). The correctness rates of our predictions based on cross-validation are as high as 96.5%, 95.4%, 99.7%, and 95.6% for Baltimore class, family, subfamily, and genus, respectively, which are comparable to the rates for single-segmented viruses only. Our method is not affected by the number or order of segments. We also demonstrate that the natural graphical representation based on the Hausdorff distance is more reasonable than the consensus tree for a recent public health threat, the influenza A (H7N9) viruses.
我们最近在向量空间中开发了一种基于基因组的病毒分类计算方法。这种方法称为“自然向量(NV)表示”,是一种无比对方法,使我们能够高速且准确地对单片段病毒进行分类。对于多片段病毒,通常会重建每个片段的系统发育树以发现病毒系统发育关系。可以使用共识树方法来组合基于不同片段的系统发育树。然而,共识树方法并非针对病毒片段数量不同或片段匹配不佳的情况而开发。我们提出了一种新方法,用于全局比较多片段病毒,即使在病毒片段数量不同的情况下也是如此。使用我们的方法,每个病毒由一组R(12)中的向量表示。然后使用豪斯多夫距离来比较不同的向量集。可以基于此距离重建系统发育树。所提出的方法用于预测具有n个片段(n⩾1)的病毒的分类标签。基于交叉验证的预测正确率,对于巴尔的摩分类、科、亚科和属分别高达96.5%、95.4%、99.7%和95.6%,这与仅针对单片段病毒的正确率相当。我们的方法不受片段数量或顺序的影响。我们还证明,对于最近的公共卫生威胁甲型H7N9流感病毒,基于豪斯多夫距离的自然图形表示比共识树更合理。