Department of Biosystems Science and Engineering, ETH Zürich, Basel 4058, Switzerland.
SIB, Swiss Institute of Bioinformatics, Basel 4058, Switzerland.
Bioinformatics. 2020 Dec 8;36(19):4854-4859. doi: 10.1093/bioinformatics/btaa599.
The high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intratumor heterogeneity (ITH) by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq datasets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods.
Here, we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq datasets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime and scalability. Its inferred genotypes were the most accurate, especially on highly heterogeneous data, and it was the only method able to run and produce results on datasets with 5000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by Supplementary Experimental Data. With ever growing scDNA-seq datasets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve ITH but also as a preprocessing step to reduce data size.
BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.
Supplementary data are available at Bioinformatics online.
单细胞 DNA 测序(scDNA-seq)的高分辨率具有很大的潜力,可以通过基于其突变谱区分克隆群体来解决肿瘤内异质性(ITH)。然而,scDNA-seq 数据集的规模不断增加以及技术限制,如高错误率和大量缺失值,使得这项任务变得复杂,并限制了现有方法的适用性。
在这里,我们引入了 BnpC,这是一种新的非参数方法,可以根据其嘈杂的突变谱将单个细胞聚类到克隆中,并推断它们的基因型。我们使用各种数据大小在模拟数据上全面基准测试了我们的方法,并将其应用于三个癌症 scDNA-seq 数据集。在模拟数据上,BnpC 在准确性、运行时间和可扩展性方面优于当前方法。它推断的基因型最准确,特别是在高度异质的数据上,并且是唯一能够在包含 5000 个细胞的数据集上运行并产生结果的方法。在肿瘤 scDNA-seq 数据上,BnpC 能够识别原始聚类分析遗漏但补充实验数据支持的克隆群体。随着 scDNA-seq 数据集的不断增长,像 BnpC 这样的可扩展且准确的方法将变得越来越重要,不仅可以解决 ITH,而且可以作为减少数据大小的预处理步骤。
BnpC 可根据麻省理工学院的许可证在 https://github.com/cbg-ethz/BnpC 上免费获得。
补充数据可在《生物信息学》在线获得。