Prillo Sebastian, Ravoor Akshay, Yosef Nir, Song Yun S
Computer Science Division, University of California, Berkeley, USA.
Department of Systems Immunology, Weizmann Institute of Science, Israel.
Syst Biol. 2025 Aug 8. doi: 10.1093/sysbio/syaf054.
Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations - in the form of DNA insertions or deletions - are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population - among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data - such as double-resection events which affect runs of consecutive sites - we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms - such as heritable missing data - by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, and (2) a data encoding scheme specific to CRISPR/Cas9-lineage tracing data which makes it amenable to the general model. Our branch length estimation method, which we call 'ConvexML', should be broadly applicable to any evolutionary model with irreversible mutations (ideally, with high diversity) and an approximately ignorable missing data mechanism. 'ConvexML' is available through the convexml open source Python package.
分支长度估计是统计系统发育学中的一个基本问题,也是树重建算法的核心组成部分。传统上,人们采用一般的时间可逆突变模型,并且有许多软件工具可用于这种情况。随着CRISPR/Cas9谱系追踪技术的出现,人们对不可逆突变模型下的分支长度估计研究产生了浓厚兴趣。在CRISPR/Cas9突变模型下,以DNA插入或缺失形式存在的不可逆突变在实验过程中积累,然后在单细胞水平上读出,以重建细胞谱系树。然而,到目前为止,对CRISPR/Cas9谱系追踪数据的大多数分析都仅限于单细胞树拓扑结构的重建,这种拓扑结构描绘了细胞之间的谱系关系,但没有显示祖先细胞状态与当前状态之间经过的时间量。时间分辨树,即年代树,将使人们能够以前所未有的分辨率研究细胞群体的进化动态。事实上,时间分辨树将揭示树上事件的时间、亚克隆的相对适应性以及细胞群体中表型变化的潜在动态——以及其他重要应用。在这项工作中,我们引入了第一种可扩展且准确的方法,通过估计其分支长度将任何给定的单细胞树拓扑结构细化为单细胞年代树。为此,我们在一般的不可逆突变模型下进行正则化最大似然估计,并与一种保守版本的最大简约法相结合,该方法仅重建我们有信心的祖先状态。为了处理CRISPR/Cas9谱系追踪数据的特殊性——例如影响连续位点序列的双切除事件——我们避免使模型更复杂,而是选择使用一种简单但有效的数据编码方案。同样,我们避免显式建模缺失数据机制——例如可遗传的缺失数据——而是假设它们是完全随机缺失的。我们通过使用带有最小分支长度约束和伪计数的简单惩罚版最大似然估计(MLE)来稳定低信息状态下的估计。所有这些都导致了一个凸MLE问题,可以使用现成的凸优化求解器在几秒钟内轻松解决。我们使用模拟数据和真实谱系追踪数据对我们的方法进行基准测试,结果表明它在几个任务上表现良好,在准确性方面与TiDeTree和LAML等竞争方法相当或更优,同时速度快10至100倍。值得注意的是,我们的统计模型更简单、更通用,因为我们没有显式建模CRISPR/Cas9谱系追踪数据的复杂性。从这个意义上说,我们的贡献有两个方面:(1)一种在一般不可逆突变模型下进行分支长度估计的快速且稳健的方法,(2)一种特定于CRISPR/Cas9谱系追踪数据的数据编码方案,使其适用于一般模型。我们的分支长度估计方法,我们称之为“ConvexML”,应该广泛适用于任何具有不可逆突变(理想情况下,具有高多样性)和近似可忽略缺失数据机制的进化模型。“ConvexML”可通过convexml开源Python包获得。