Li Heng
Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, MA 02215, USA.
Department of Data Science, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA.
ArXiv. 2025 Aug 8:arXiv:2507.03718v2.
While benchmarks on short-read variant calling suggest low error rate below 0.5%, they are only applicable to predefined confident regions. For a human sample without such regions, the error rate could be 10 times higher. Although multiple sets of easy regions have been identified to alleviate the issue, they fail to consider non-reference samples or are biased towards existing short-read data or aligners.
Here, using hundreds of high-quality human assemblies, we derived a set of sample-agnostic easy regions where short-read variant calling reaches high accuracy. These regions cover 88.2% of GRCh38, 92.2% of coding regions and 96.3% of ClinVar pathogenic variants. They achieve a good balance between coverage and easiness and can be generated for other human assemblies or species with multiple well assembled genomes.
This resource provides a convient and powerful way to filter spurious variant calls for clinical or research human samples.
虽然短读长变异检测的基准表明错误率低于0.5%,但它们仅适用于预定义的可靠区域。对于没有此类区域的人类样本,错误率可能高出10倍。尽管已经确定了多组容易区域来缓解这个问题,但它们没有考虑非参考样本,或者偏向于现有的短读长数据或比对器。
在这里,我们使用数百个高质量的人类基因组组装,得出了一组与样本无关的容易区域,在这些区域短读长变异检测可达到高精度。这些区域覆盖了GRCh38的88.2%、编码区域的92.2%和ClinVar致病性变异的96.3%。它们在覆盖范围和易处理性之间实现了良好的平衡,并且可以为其他人类基因组组装或具有多个良好组装基因组的物种生成。
该资源为过滤临床或研究用人类样本中的虚假变异检测提供了一种方便且强大的方法。