Suppr超能文献

具有混合型缺失数据的时间尺度细胞谱系树的最大似然推断

Maximum Likelihood Inference of Time-scaled Cell Lineage Trees with Mixed-type Missing Data.

作者信息

Mai Uyen, Chu Gillian, Raphael Benjamin J

机构信息

Department of Computer Science, Princeton University, Princeton, NJ 08544, USA.

出版信息

bioRxiv. 2024 Mar 23:2024.03.05.583638. doi: 10.1101/2024.03.05.583638.

Abstract

Recent dynamic lineage tracing technologies combine CRISPR-based genome editing with single-cell sequencing to track cell divisions during development. A key computational problem in dynamic lineage tracing is to infer a cell lineage tree from the measured CRISPR-induced mutations. Three features of dynamic lineage tracing data distinguish this problem from standard phylogenetic tree inference. First, the CRISPR-editing process modifies a genomic location exactly once. This property is not well described by the time-reversible models commonly used in phylogenetics. Second, as a consequence of non-modifiability, the number of mutations per time unit decreases over time. Third, CRISPR-based genome-editing and single-cell sequencing results in high rates of both heritable and non-heritable (dropout) missing data. To model these features, we introduce the Probabilistic Mixed-type Missing (PMM) model. We describe an algorithm, LAML (Lineage Analysis via Maximum Likelihood), to search for the maximum likelihood (ML) tree under the PMM model. LAML combines an Expectation Maximization (EM) algorithm with a heuristic tree search to jointly estimate tree topology, branch lengths and missing data parameters. We derive a closed-form solution for the M-step in the case of no heritable missing data, and a block coordinate ascent approach in the general case which is more efficient than the standard General Time Reversible (GTR) phylogenetic model. On simulated data, LAML infers more accurate tree topologies and branch lengths than existing methods, with greater advantages on datasets with higher ratios of heritable to non-heritable missing data. We show that LAML provides unbiased estimates of branch lengths. In contrast, we demonstrate that maximum parsimony methods for lineage tracing data not only underestimate branch lengths, but also yield branch lengths which are not proportional to time, due to the nonlinear decay in the number of mutations on branches further from the root. On lineage tracing data from a mouse model of lung adenocarcinoma, we show that LAML infers phylogenetic distances that are more concordant with gene expression data compared to distances derived from maximum parsimony. The LAML tree topology is more plausible than existing published trees, with fewer total cell migrations between distant metastases and fewer reseeding events where cells migrate back to the primary tumor. Crucially, we identify three distinct time epochs of metastasis progression, which includes a burst of metastasis events to various anatomical sites during a single month.

摘要

最近的动态谱系追踪技术将基于CRISPR的基因组编辑与单细胞测序相结合,以追踪发育过程中的细胞分裂。动态谱系追踪中的一个关键计算问题是从测量的CRISPR诱导突变中推断细胞谱系树。动态谱系追踪数据的三个特征将这个问题与标准的系统发育树推断区分开来。首先,CRISPR编辑过程对基因组位置只进行一次修改。这种特性无法用系统发育学中常用的时间可逆模型很好地描述。其次,由于不可修改性,每单位时间的突变数量会随着时间减少。第三,基于CRISPR的基因组编辑和单细胞测序导致可遗传和不可遗传(缺失)数据的高缺失率。为了对这些特征进行建模,我们引入了概率混合型缺失(PMM)模型。我们描述了一种算法,即最大似然谱系分析(LAML),用于在PMM模型下搜索最大似然(ML)树。LAML将期望最大化(EM)算法与启发式树搜索相结合,以联合估计树拓扑结构、分支长度和缺失数据参数。在无遗传缺失数据的情况下,我们推导出了M步的闭式解,在一般情况下推导出了一种块坐标上升方法,该方法比标准的一般时间可逆(GTR)系统发育模型更有效。在模拟数据上,LAML推断出的树拓扑结构和分支长度比现有方法更准确,在可遗传与不可遗传缺失数据比例更高的数据集上优势更大。我们表明,LAML提供了无偏的分支长度估计。相比之下,我们证明,用于谱系追踪数据的最大简约法不仅低估了分支长度,而且由于远离根的分支上突变数量的非线性衰减,产生的分支长度与时间不成比例。在来自肺腺癌小鼠模型的谱系追踪数据上,我们表明,与从最大简约法得出的距离相比,LAML推断的系统发育距离与基因表达数据更一致。LAML树拓扑结构比现有的已发表树更合理,远处转移之间的总细胞迁移更少,细胞迁移回原发性肿瘤的重新播种事件也更少。至关重要的是,我们确定了转移进展的三个不同时间阶段,其中包括在一个月内发生的向各个解剖部位的转移事件爆发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/243d/10966877/ae2bd0da7b66/nihpp-2024.03.05.583638v2-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验