School of Computer Science and Technology, Liaocheng University, Liaocheng, China.
Orthopedics Department, Liaocheng People's Hospital, Liaocheng, China.
PeerJ. 2024 Jul 26;12:e17748. doi: 10.7717/peerj.17748. eCollection 2024.
Tandem duplication (TD) is a common and important type of structural variation in the human genome. TDs have been shown to play an essential role in many diseases, including cancer. However, it is difficult to accurately detect TDs due to the uneven distribution of reads and the inherent complexity of next-generation sequencing (NGS) data.
This article proposes a method called DTDHM (detection of tandem duplications based on hybrid methods), which utilizes NGS data to detect TDs in a single sample. DTDHM builds a pipeline that integrates read depth (RD), split read (SR), and paired-end mapping (PEM) signals. To solve the problem of uneven distribution of normal and abnormal samples, DTDHM uses the K-nearest neighbor (KNN) algorithm for multi-feature classification prediction. Then, the qualified split reads and discordant reads are extracted and analyzed to achieve accurate localization of variation sites. This article compares DTDHM with three other methods on 450 simulated datasets and five real datasets.
In 450 simulated data samples, DTDHM consistently maintained the highest F1-score. The average F1-score of DTDHM, SVIM, TARDIS, and TIDDIT were 80.0%, 56.2%, 43.4%, and 67.1%, respectively. The F1-score of DTDHM had a small variation range and its detection effect was the most stable and 1.2 times that of the suboptimal method. Most of the boundary biases of DTDHM fluctuated around 20 bp, and its boundary deviation detection ability was better than TARDIS and TIDDIT. In real data experiments, five real sequencing samples (NA19238, NA19239, NA19240, HG00266, and NA12891) were used to test DTDHM. The results showed that DTDHM had the highest overlap density score (ODS) and F1-score of the four methods.
Compared with the other three methods, DTDHM achieved excellent results in terms of sensitivity, precision, F1-score, and boundary bias. These results indicate that DTDHM can be used as a reliable tool for detecting TDs from NGS data, especially in the case of low coverage depth and tumor purity samples.
串联重复(TD)是人类基因组中一种常见且重要的结构变异类型。TD 已被证明在许多疾病中发挥着重要作用,包括癌症。然而,由于读取的不均匀分布和下一代测序(NGS)数据的固有复杂性,准确检测 TD 具有一定难度。
本文提出了一种名为 DTDHM(基于混合方法的串联重复检测)的方法,该方法利用 NGS 数据在单个样本中检测 TD。DTDHM 构建了一个整合读取深度(RD)、拆分读取(SR)和配对末端映射(PEM)信号的流水线。为了解决正常和异常样本分布不均匀的问题,DTDHM 使用 K-最近邻(KNN)算法进行多特征分类预测。然后,提取和分析合格的拆分读取和不一致读取,以实现变异位点的准确定位。本文在 450 个模拟数据集和 5 个真实数据集上比较了 DTDHM 与其他三种方法。
在 450 个模拟数据样本中,DTDHM 始终保持最高的 F1 分数。DTDHM、SVIM、TARDIS 和 TIDDIT 的平均 F1 分数分别为 80.0%、56.2%、43.4%和 67.1%。DTDHM 的 F1 分数变化范围较小,其检测效果最稳定,是次优方法的 1.2 倍。DTDHM 的大多数边界偏差波动在 20bp 左右,其边界偏差检测能力优于 TARDIS 和 TIDDIT。在真实数据实验中,使用五个真实测序样本(NA19238、NA19239、NA19240、HG00266 和 NA12891)测试 DTDHM。结果表明,DTDHM 在四种方法中的重叠密度得分(ODS)和 F1 分数最高。
与其他三种方法相比,DTDHM 在灵敏度、精度、F1 分数和边界偏差方面都取得了优异的结果。这些结果表明,DTDHM 可作为一种从 NGS 数据中检测 TD 的可靠工具,特别是在覆盖深度低和肿瘤纯度样本的情况下。