Hassler Gabriel, Tolkoff Max R, Allen William L, Ho Lam Si Tung, Lemey Philippe, Suchard Marc A
Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, United States.
Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California, Los Angeles, United States.
J Am Stat Assoc. 2022;117(538):678-692. doi: 10.1080/01621459.2020.1799812. Epub 2020 Sep 16.
Comparative biologists are often interested in inferring covariation between multiple biological traits sampled across numerous related taxa. To properly study these relationships, we must control for the shared evolutionary history of the taxa to avoid spurious inference. An additional challenge arises as obtaining a full suite of measurements becomes increasingly difficult with increasing taxa. This generally necessitates data imputation or integration, and existing control techniques typically scale poorly as the number of taxa increases. We propose an inference technique that integrates out missing measurements analytically and scales linearly with the number of taxa by using a post-order traversal algorithm under a multivariate Brownian diffusion (MBD) model to characterize trait evolution. We further exploit this technique to extend the MBD model to account for sampling error or non-heritable residual variance. We test these methods to examine mammalian life history traits, prokaryotic genomic and phenotypic traits, and HIV infection traits. We find computational efficiency increases that top two orders-of-magnitude over current best practices. While we focus on the utility of this algorithm in phylogenetic comparative methods, our approach generalizes to solve long-standing challenges in computing the likelihood for matrix-normal and multivariate normal distributions with missing data at scale.
比较生物学家通常对推断众多相关分类群中多个生物特征之间的协变关系感兴趣。为了恰当地研究这些关系,我们必须控制分类群的共同进化历史,以避免错误推断。随着分类群数量的增加,获取一整套测量数据变得越来越困难,这又带来了一个额外的挑战。这通常需要进行数据插补或整合,而现有的控制技术通常随着分类群数量的增加而扩展性较差。我们提出了一种推断技术,该技术通过在多元布朗扩散(MBD)模型下使用后序遍历算法来解析地整合缺失的测量数据,并随着分类群数量线性扩展,以表征性状进化。我们进一步利用该技术扩展MBD模型,以考虑抽样误差或非遗传残差方差。我们测试了这些方法,以研究哺乳动物的生活史特征、原核生物的基因组和表型特征以及HIV感染特征。我们发现计算效率比当前的最佳实践提高了两个数量级。虽然我们专注于该算法在系统发育比较方法中的效用,但我们的方法具有通用性,可解决大规模计算具有缺失数据的矩阵正态分布和多元正态分布的似然性方面长期存在的挑战。