SAMM (Statistique, Analyse et Modélisation Multidisciplinaire), EA 4543, Université Paris 1 Panthéon Sorbonne, 90 rue de Tolbiac, Paris, 75013, France.
Mol Ecol Resour. 2013 Nov;13(6):976-90. doi: 10.1111/1755-0998.12047. Epub 2013 Jan 3.
Developing tools for visualizing DNA sequences is an important issue in the Barcoding context. Visualizing Barcode data can be put in a purely statistical context, unsupervised learning. Clustering methods combined with projection methods have two closely linked objectives, visualizing and finding structure in the data. Multidimensional scaling (MDS) and Self-organizing maps (SOM) are unsupervised statistical tools for data visualization. Both algorithms map data onto a lower dimensional manifold: MDS looks for a projection that best preserves pairwise distances while SOM preserves the topology of the data. Both algorithms were initially developed for Euclidean data and the conditions necessary to their good implementation were not satisfied for Barcode data. We developed a workflow consisting in four steps: collapse data into distinct sequences; compute a dissimilarity matrix; run a modified version of SOM for dissimilarity matrices to structure the data and reduce dimensionality; project the results using MDS. This methodology was applied to Astraptes fulgerator and Hylomyscus, an African rodent with debated taxonomy. We obtained very good results for both data sets. The results were robust against unbalanced species. All the species in Astraptes were well displayed in very distinct groups in the various visualizations, except for LOHAMP and FABOV that were mixed up. For Hylomyscus, our findings were consistent with known species, confirmed the existence of four unnamed taxa and suggested the existence of potentially new species.
开发用于可视化 DNA 序列的工具是条形码背景下的一个重要问题。条形码数据的可视化可以放在纯粹的统计背景下,即无监督学习。聚类方法与投影方法相结合有两个紧密相关的目标,即可视化和发现数据中的结构。多维尺度(MDS)和自组织映射(SOM)是用于数据可视化的无监督统计工具。这两种算法都将数据映射到较低维的流形上:MDS 寻找最佳保留成对距离的投影,而 SOM 则保留数据的拓扑结构。这两种算法最初都是为欧几里得数据开发的,而条形码数据并不满足其良好实现的必要条件。我们开发了一个由四个步骤组成的工作流程:将数据折叠成不同的序列;计算不相似矩阵;运行不相似矩阵的 SOM 修正版本以对数据进行结构和降维;使用 MDS 对结果进行投影。该方法应用于 Astraptes fulgerator 和 Hylomyscus,这是一种具有争议性分类的非洲啮齿动物。我们对这两个数据集都得到了非常好的结果。结果对不平衡的物种具有鲁棒性。在各种可视化中,除了 LOHAMP 和 FABOV 混合在一起之外,Astraptes 中的所有物种都以非常明显的组显示。对于 Hylomyscus,我们的发现与已知物种一致,证实了四个未命名分类单元的存在,并暗示了潜在新物种的存在。