Suppr超能文献

通过生物学重复提高体细胞外显子组测序性能。

Improving somatic exome sequencing performance by biological replicates.

机构信息

Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul, Turkey.

出版信息

BMC Bioinformatics. 2024 Mar 22;25(1):124. doi: 10.1186/s12859-024-05742-5.

Abstract

BACKGROUND

Next-generation sequencing (NGS) technologies offer fast and inexpensive identification of DNA sequences. Somatic sequencing is among the primary applications of NGS, where acquired (non-inherited) variants are based on comparing diseased and healthy tissues from the same individual. Somatic mutations in genetic diseases such as cancer are tightly associated with genomic instability. Genomic instability increases heterogenity, complicating sequencing efforts further, a task already challenged by the presence of short reads and repetitions in human DNA. This leads to low concordance among studies and limits reproducibility. This limitation is a significant problem since identified mutations in somatic sequencing are major biomarkers for diagnosis and the primary input of targeted therapies. Benchmarking studies were conducted to assess the error rates and increase reproducibility. Unfortunately, the number of somatic benchmarking sets is very limited due to difficulties in validating true somatic variants. Moreover, most NGS benchmarking studies are based on relatively simpler germline (inherited) sequencing. Recently, a comprehensive somatic sequencing benchmarking set was published by Sequencing Quality Control Phase 2 (SEQC2). We chose this dataset for our experiments because it is a well-validated, cancer-focused dataset that includes many tumor/normal biological replicates. Our study has two primary goals. First goal is to determine how replicate-based consensus approaches can improve the accuracy of somatic variant detection systems. Second goal is to develop highly predictive machine learning (ML) models by employing replicate-based consensus variants as labels during the training phase.

RESULTS

Ensemble approaches that combine alternative algorithms are relatively common; here, as an alternative, we study the performance enhancement potential of biological replicates. We first developed replicate-based consensus approaches that utilize the biological replicates available in this study to improve variant calling performance. Subsequently, we trained ML models using these biological replicates and achieved performance comparable to optimal ML models, those trained using high-confidence variants identified in advance.

CONCLUSIONS

Our replicate-based consensus approach can be used to improve variant calling performance and develop efficient ML models. Given the relative ease of obtaining biological replicates, this strategy allows for the development of efficient ML models tailored to specific datasets or scenarios.

摘要

背景

下一代测序(NGS)技术提供了快速且廉价的 DNA 序列鉴定方法。体细胞测序是 NGS 的主要应用之一,其中获得的(非遗传)变体基于比较同一个体的患病和健康组织。遗传疾病(如癌症)中的体细胞突变与基因组不稳定性密切相关。基因组不稳定性增加了异质性,进一步增加了测序的复杂性,而人类 DNA 中的短读长和重复则使这一任务更加具有挑战性。这导致了研究之间的低一致性,并限制了可重复性。由于在体细胞测序中鉴定出的突变是诊断的主要生物标志物和靶向治疗的主要输入,因此这种局限性是一个重大问题。基准测试研究旨在评估错误率并提高可重复性。不幸的是,由于难以验证真正的体细胞变体,体细胞基准测试集的数量非常有限。此外,大多数 NGS 基准测试研究都是基于相对简单的种系(遗传)测序。最近,测序质量控制阶段 2(SEQC2)发布了一个全面的体细胞测序基准测试集。我们选择这个数据集进行实验,因为它是一个经过良好验证的、专注于癌症的数据集,其中包括许多肿瘤/正常的生物学重复。我们的研究有两个主要目标。第一个目标是确定基于重复的共识方法如何提高体细胞变体检测系统的准确性。第二个目标是通过在训练阶段将基于重复的共识变体用作标签来开发具有高度预测能力的机器学习(ML)模型。

结果

组合替代算法的集成方法相对常见;在这里,作为替代方法,我们研究了生物重复的性能增强潜力。我们首先开发了基于重复的共识方法,利用本研究中可用的生物重复来提高变体调用性能。随后,我们使用这些生物重复训练 ML 模型,并取得了与使用提前确定的高置信变体训练的最佳 ML 模型相当的性能。

结论

我们基于重复的共识方法可用于提高变体调用性能和开发高效的 ML 模型。鉴于获得生物重复的相对容易性,这种策略允许针对特定数据集或场景开发高效的 ML 模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40d6/10958848/de3e5954384b/12859_2024_5742_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验