Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy.
IRCCS Humanitas Research Hospital, Via Manzoni 56, 20089, Milan, Rozzano, Italy.
BMC Bioinformatics. 2024 May 8;25(1):180. doi: 10.1186/s12859-024-05793-8.
High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling.
Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants.
Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.
高通量测序(HTS)已成为癌症研究中变体分析的金标准方法。然而,由于正常细胞或肿瘤异质性的污染,体细胞变体可能以低分数出现;这对标准 HTS 分析管道提出了重大挑战。在肿瘤 DNA 含量极少的情况下,例如血浆中的循环肿瘤 DNA,情况会更加严重。评估此类情况下 HTS 方法的灵敏度和检测至关重要,但需要耗费大量时间和金钱:需要专门的实验方案和足够数量的样本进行处理和分析。为了克服这些限制,我们提出了一种新的计算方法,专门用于生成适合此任务的人工数据集,模拟超低分数变体的靶向测序数据,并证明其在基准超低分数变体调用中的有效性。
我们的方法通过使用 NEAT(一种细粒度的读取模拟器),无需依赖现有数据,即可生成模拟真实数据的人工原始读取,该方法使用从多个不同数据集学习到的模型生成人工数据集。然后,它将低分数变体纳入其中,以模拟肿瘤 DNA 含量极少的样本中的体细胞突变。为了证明所创建的人工数据集在超低分数变体调用基准测试中的适用性,我们将其用作ground truth 来评估广泛使用的变体调用算法的性能:它们允许我们定义主要变体调用者的调谐参数值,极大地提高了它们对极低分数变体的检测能力。
我们的研究结果不仅突出了我们的方法在创建具有低肿瘤分数的合适人工数据集方面的关键作用,还促进了针对此类数据集类型的算法的快速原型设计和基准测试,同时也强调了推进超低分数变体调用技术的重要性。