一种用于从5061份绵羊测序数据中改进基因变异识别的计算框架。

A computational framework for improving genetic variants identification from 5,061 sheep sequencing data.

作者信息

Xie Shangqian, Isaacs Karissa, Becker Gabrielle, Murdoch Brenda M

机构信息

Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA.

Superior Farms, California, USA.

出版信息

J Anim Sci Biotechnol. 2023 Oct 2;14(1):127. doi: 10.1186/s40104-023-00923-3.

DOI:10.1186/s40104-023-00923-3

PMID:37779189

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10544426/

Abstract

BACKGROUND

Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping.

RESULTS

In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154).

CONCLUSION

The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.

摘要

背景

泛基因组学是一种最近出现的策略，可用于更全面地表征遗传变异。联合调用通常用于合并多个相关样本中已识别的变异。然而，对于群体规模的基因分型，利用多个样本的相互支持信息来改进变异识别仍然非常有限。

结果

在本研究中，我们通过纳入测序误差并优化来自多个样本数据的相互支持信息，开发了一个用于联合调用5061只绵羊遗传变异的计算框架。通过四个步骤从多个样本中准确识别变异：（1）通过纳入碱基测序误差潜力的泊松模型计算两种广泛使用的算法GATK和Freebayes的变异概率；（2）使用GATK和Freebayes从至少两个样本中一致识别出的具有高映射质量的变异用于构建原始高置信度识别（rHID）变异数据库；（3）使用rHID数据库按概率值对单样本中识别出的高置信度变异进行排序并通过错误发现率（FDR）进行控制；（4）为避免从rHID数据库中消除潜在的真实变异，对未通过FDR的变异进行重新检查以挽救潜在的真实变异并确保高准确识别变异。结果表明，与原始变异相比，我们的新方法处理后Freebayes和GATK的一致SNP和Indel百分比显著提高了12%-32%，并且有利地发现了涉及多个性状的个体绵羊的低频变异，包括乳头数量（GPC5）、羊瘙痒病病理学（PAPSS2）、季节性繁殖和产仔数（GRM1）、毛色（RAB27A）以及慢病毒易感性（TMEM154）。