Zhai Yue, Bardel Claire, Vallée Maxime, Iwaz Jean, Roy Pascal
Université Lyon 1, Lyon, France.
Université de Lyon, Lyon, France.
Front Genet. 2023 Mar 16;14:1148147. doi: 10.3389/fgene.2023.1148147. eCollection 2023.
To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila-adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison with no use of a combination model, i) the consensus model improved precision by 0.1%; ii) the latent class model brought 1% precision improvement (97%-98%) without compromising sensitivity (= 98.9%); iii) the Gaussian mixture model and random forest provided callsets with higher precisions (both >99%) but lower sensitivities; iv) Kamila increased precision (>99%) and kept a high sensitivity (98.8%); it showed the best overall performance. According to precision and F1-score indicators, the compared non-supervised clustering models that combine multiple callsets are able to improve sequencing performance vs. previously used supervised models. Among the models compared, the Gaussian mixture model and Kamila offered non-negligible precision and F1-score improvements. These models may be thus recommended for callset reconstruction (from either biological or technical replicates) for diagnostic or precision medicine purposes.
为了提高个体DNA测序结果的性能,研究人员通常使用来自同一个体的重复样本和各种统计聚类模型来重建高性能的变异集。在此,我们考虑了基因组NA12878的三个技术重复样本,并针对四个性能指标(灵敏度、精确率、准确率和F1分数)比较了五种模型类型(一致性模型、潜在类别模型、高斯混合模型、Kamila自适应k均值模型和随机森林模型)。与不使用组合模型相比,i)一致性模型的精确率提高了0.1%;ii)潜在类别模型在不影响灵敏度(=98.9%)的情况下,精确率提高了1%(从97%提高到98%);iii)高斯混合模型和随机森林模型提供的变异集精确率更高(均>99%),但灵敏度较低;iv)Kamila模型提高了精确率(>99%)并保持了较高的灵敏度(98.8%);它显示出最佳的整体性能。根据精确率和F1分数指标,与之前使用的监督模型相比,所比较的结合多个变异集的无监督聚类模型能够提高测序性能。在所比较的模型中,高斯混合模型和Kamila模型在精确率和F1分数方面有不可忽视的提高。因此,这些模型可推荐用于诊断或精准医学目的的变异集重建(来自生物或技术重复样本)。