Section for Parasitology (SWEPAR), Swedish University of Agricultural Sciences, Uppsala, Sweden.
Mol Biol Evol. 2010 May;27(5):1044-57. doi: 10.1093/molbev/msp309. Epub 2009 Dec 24.
Exploratory data analysis (EDA) is a frequently undervalued part of data analysis in biology. It involves evaluating the characteristics of the data "before" proceeding to the definitive analysis in relation to the scientific question at hand. For phylogenetic analyses, a useful tool for EDA is a data-display network. This type of network is designed to display any character (or tree) conflict in a data set, without prior assumptions about the causes of those conflicts. The conflicts might be caused by 1) methodological issues in data collection or analysis, 2) homoplasy, or 3) horizontal gene flow of some sort. Here, I explore 13 published data sets using splits networks, as examples of using data-display networks for EDA. In each case, I performed an original EDA on the data provided, to highlight the aspects of the resulting network that will be important for an interpretation of the phylogeny. In each case, there is at least one important point (possibly missed by the original authors) that might affect the phylogenetic analysis. I conclude that EDA should play a greater role in phylogenetic analyses than it has done.
探索性数据分析(EDA)是生物学数据分析中经常被低估的一部分。它涉及在针对当前科学问题进行明确分析之前,评估数据的特征。对于系统发育分析,EDA 的一个有用工具是数据显示网络。这种类型的网络旨在显示数据集内的任何字符(或树)冲突,而无需对这些冲突的原因做出预先假设。冲突可能由以下原因引起:1)数据收集或分析方法上的问题,2)趋同进化,或 3)某种水平基因转移。在这里,我使用分支网络探索了 13 个已发表的数据集,作为使用数据显示网络进行 EDA 的示例。在每种情况下,我对提供的数据进行了原始 EDA,以突出显示网络的结果对于理解系统发育的重要方面。在每种情况下,都至少有一个重要的点(可能被原始作者忽略)可能会影响系统发育分析。我得出结论,EDA 在系统发育分析中应该发挥比以往更大的作用。