Legried Brandon, Terhorst Jonathan
School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, 30332, GA, USA.
Department of Statistics, University of Michigan, 1085 S. University Ave, Ann Arbor, 48109, MI, USA.
J Theor Biol. 2023 Jul 7;568:111520. doi: 10.1016/j.jtbi.2023.111520. Epub 2023 May 4.
Recent theoretical work on phylogenetic birth-death models offers differing viewpoints on whether they can be estimated using lineage-through-time data. Louca and Pennell (2020) showed that the class of models with continuously differentiable rate functions is nonidentifiable: any such model is consistent with an infinite collection of alternative models, which are statistically indistinguishable regardless of how much data are collected. Legried and Terhorst (2022) qualified this grave result by showing that identifiability is restored if only piecewise constant rate functions are considered. Here, we contribute new theoretical results to this discussion, in both the positive and negative directions. Our main result is to prove that models based on piecewise polynomial rate functions of any order and with any (finite) number of pieces are statistically identifiable. In particular, this implies that spline-based models with an arbitrary number of knots are identifiable. The proof is simple and self-contained, relying mainly on basic algebra. We complement this positive result with a negative one, which shows that even when identifiability holds, rate function estimation is still a difficult problem. To illustrate this, we prove some rates-of-convergence results for hypothesis testing using birth-death models. These results are information-theoretic lower bounds which apply to all potential estimators.
近期关于系统发生出生-死亡模型的理论研究,对于能否使用沿时间谱系数据进行估计给出了不同观点。卢卡和彭内尔(2020年)表明,具有连续可微速率函数的模型类别是无法识别的:任何此类模型都与无穷多个替代模型一致,无论收集多少数据,这些替代模型在统计上都无法区分。勒格里德和特尔霍斯特(2022年)对这一严峻结果进行了修正,表明如果仅考虑分段常数速率函数,可识别性就能恢复。在此,我们在正反两个方向上为这一讨论贡献了新的理论成果。我们的主要结果是证明,基于任意阶分段多项式速率函数且具有任意(有限)段数的模型在统计上是可识别的。特别地,这意味着具有任意数量节点的基于样条的模型是可识别的。证明过程简单且自成一体,主要依赖于基础代数。我们用一个负面结果对这一正面结果进行补充,该负面结果表明,即使可识别性成立,速率函数估计仍然是一个难题。为了说明这一点,我们证明了一些使用出生-死亡模型进行假设检验的收敛速率结果。这些结果是信息论下界,适用于所有潜在估计量。