模拟非洲和非非洲低覆盖度和高覆盖度全基因组序列数据，以评估变异调用方法。

Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches.

机构信息

Faculty of Health Sciences, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.

Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa.

出版信息

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa366.

DOI:10.1093/bib/bbaa366

PMID:33341897

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8294538/

Abstract

Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.

摘要

当前的变异调用 (VC) 方法旨在利用长程单倍型群体，并使用欧洲血统的群体进行基准测试，而大多数遗传多样性存在于非欧洲群体中，如非洲人群。在处理这些具有遗传多样性的人群时，VC 工具可能会产生假阳性和假阴性结果，这可能会导致在突变优先级、基因的临床相关性和可操作性方面产生误导性结论。最突出的问题是，在分析具有低或高序列覆盖度的非洲数据时，哪种工具或管道具有高灵敏度和精度，考虑到这种数据的高度遗传多样性和异质性。在这里，总共生成了 100 个合成全基因组测序 (WGS) 样本，模拟了非洲和欧洲个体的遗传特征，用于不同特定覆盖度水平（高/低），以评估 9 种不同 VC 工具在这些对比数据集上的性能。通过将模拟的黄金变体与每个 VC 工具识别的变体进行比较，评估了这些工具的假阳性和假阴性调用率的性能。综合我们在敏感性和阳性预测值 (PPV) 上的结果，VarDict [PPV = 0.999 和 Matthews 相关系数 (MCC) = 0.832] 和 BCFtools（PPV = 0.999 和 MCC = 0.813）在使用高覆盖度和低覆盖度非洲人群数据时表现最佳。总体而言，与欧洲数据相比，当前的 VC 工具在分析非洲数据时会产生较高的假阳性和假阴性率。这突出表明需要开发具有高灵敏度和精度的 VC 方法，以适应具有高遗传变异和低连锁不平衡特征的人群。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

模拟非洲和非非洲低覆盖度和高覆盖度全基因组序列数据，以评估变异调用方法。

Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

模拟非洲和非非洲低覆盖度和高覆盖度全基因组序列数据，以评估变异调用方法。

Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献