Qi Yuanyuan, Pradhan Dikshant, El-Kebir Mohammed
1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA.
2Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA.
Algorithms Mol Biol. 2019 Sep 3;14:19. doi: 10.1186/s13015-019-0155-6. eCollection 2019.
Tumors exhibit extensive intra-tumor heterogeneity, the presence of groups of cellular populations with distinct sets of somatic mutations. This heterogeneity is the result of an evolutionary process, described by a phylogenetic tree. In addition to enabling clinicians to devise patient-specific treatment plans, phylogenetic trees of tumors enable researchers to decipher the mechanisms of tumorigenesis and metastasis. However, the problem of reconstructing a phylogenetic tree given bulk sequencing data from a tumor is more complicated than the classic phylogeny inference problem. Rather than observing the leaves of directly, we are given mutation frequencies that are the result of mixtures of the leaves of . The majority of current tumor phylogeny inference methods employ the perfect phylogeny evolutionary model. The underlying Perfect Phylogeny Mixture (PPM) combinatorial problem typically has multiple solutions.
We prove that determining the exact number of solutions to the PPM problem is #P-complete and hard to approximate within a constant factor. Moreover, we show that sampling solutions uniformly at random is hard as well. On the positive side, we provide a polynomial-time computable upper bound on the number of solutions and introduce a simple rejection-sampling based scheme that works well for small instances. Using simulated and real data, we identify factors that contribute to and counteract non-uniqueness of solutions. In addition, we study the sampling performance of current methods, identifying significant biases.
Awareness of non-uniqueness of solutions to the PPM problem is key to drawing accurate conclusions in downstream analyses based on tumor phylogenies. This work provides the theoretical foundations for non-uniqueness of solutions in tumor phylogeny inference from bulk DNA samples.
肿瘤呈现出广泛的肿瘤内异质性,即存在具有不同体细胞突变集的细胞群体。这种异质性是一个进化过程的结果,可用系统发育树来描述。除了能让临床医生制定针对患者的治疗方案外,肿瘤的系统发育树还能使研究人员解读肿瘤发生和转移的机制。然而,根据肿瘤的批量测序数据重建系统发育树的问题比经典的系统发育推断问题更为复杂。我们并非直接观察系统发育树的叶子,而是得到了作为叶子混合物结果的突变频率。当前大多数肿瘤系统发育推断方法采用完美系统发育进化模型。潜在的完美系统发育混合(PPM)组合问题通常有多个解。
我们证明确定PPM问题解的精确数量是#P完全问题,并且难以在常数因子内近似求解。此外,我们表明随机均匀采样解也很困难。从积极的方面来看,我们提供了一个多项式时间可计算的解数量上限,并引入了一种基于简单拒绝采样的方案,该方案对小实例效果良好。使用模拟数据和真实数据,我们确定了导致和抵消解的非唯一性的因素。此外,我们研究了当前方法的采样性能,发现了显著的偏差。
认识到PPM问题解的非唯一性是在基于肿瘤系统发育的下游分析中得出准确结论的关键。这项工作为从批量DNA样本进行肿瘤系统发育推断中解的非唯一性提供了理论基础。