Adams Dean C, Collyer Michael L
Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA.
Department of Statistics, Iowa State University, Ames, IA, USA.
Syst Biol. 2018 Jan 1;67(1):14-31. doi: 10.1093/sysbio/syx055.
Recent years have seen increased interest in phylogenetic comparative analyses of multivariate data sets, but to date the varied proposed approaches have not been extensively examined. Here we review the mathematical properties required of any multivariate method, and specifically evaluate existing multivariate phylogenetic comparative methods in this context. Phylogenetic comparative methods based on the full multivariate likelihood are robust to levels of covariation among trait dimensions and are insensitive to the orientation of the data set, but display increasing model misspecification as the number of trait dimensions increases. This is because the expected evolutionary covariance matrix (V) used in the likelihood calculations becomes more ill-conditioned as trait dimensionality increases, and as evolutionary models become more complex. Thus, these approaches are only appropriate for data sets with few traits and many species. Methods that summarize patterns across trait dimensions treated separately (e.g., SURFACE) incorrectly assume independence among trait dimensions, resulting in nearly a 100% model misspecification rate. Methods using pairwise composite likelihood are highly sensitive to levels of trait covariation, the orientation of the data set, and the number of trait dimensions. The consequences of these debilitating deficiencies are that a user can arrive at differing statistical conclusions, and therefore biological inferences, simply from a dataspace rotation, like principal component analysis. By contrast, algebraic generalizations of the standard phylogenetic comparative toolkit that use the trace of covariance matrices are insensitive to levels of trait covariation, the number of trait dimensions, and the orientation of the data set. Further, when appropriate permutation tests are used, these approaches display acceptable Type I error and statistical power. We conclude that methods summarizing information across trait dimensions, as well as pairwise composite likelihood methods should be avoided, whereas algebraic generalizations of the phylogenetic comparative toolkit provide a useful means of assessing macroevolutionary patterns in multivariate data. Finally, we discuss areas in which multivariate phylogenetic comparative methods are still in need of future development; namely highly multivariate Ornstein-Uhlenbeck models and approaches for multivariate evolutionary model comparisons.
近年来,人们对多变量数据集的系统发育比较分析越来越感兴趣,但迄今为止,各种提出的方法尚未得到广泛检验。在这里,我们回顾了任何多变量方法所需的数学性质,并在此背景下具体评估现有的多变量系统发育比较方法。基于完全多变量似然性的系统发育比较方法对性状维度之间的协变水平具有稳健性,并且对数据集的方向不敏感,但随着性状维度数量的增加,模型误设会增加。这是因为似然性计算中使用的预期进化协方差矩阵(V)随着性状维度增加以及进化模型变得更加复杂而变得条件数更差。因此,这些方法仅适用于性状少而物种多的数据集。分别处理性状维度上的模式总结方法(例如SURFACE)错误地假设性状维度之间相互独立,导致模型误设率几乎达到100%。使用成对复合似然性的方法对性状协变水平、数据集的方向以及性状维度的数量高度敏感。这些严重缺陷的后果是,用户仅通过数据空间旋转(如主成分分析)就可能得出不同的统计结论,进而得出不同的生物学推断。相比之下,使用协方差矩阵迹的标准系统发育比较工具包的代数推广对性状协变水平、性状维度数量和数据集的方向不敏感。此外,当使用适当的置换检验时,这些方法显示出可接受的I型错误率和统计功效。我们得出结论,应避免总结性状维度信息的方法以及成对复合似然性方法,而系统发育比较工具包的代数推广提供了一种评估多变量数据中宏观进化模式的有用方法。最后,我们讨论了多变量系统发育比较方法仍需要未来发展的领域;即高度多变量的奥恩斯坦 - 乌伦贝克模型和多变量进化模型比较方法。