School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ 85281, USA.
Department of Anthropology, University of Utah, 260 S Central Drive, Carolyn and Kem Gardner Commons, Suite 4625, Salt Lake City, UT 84112, USA.
Gigascience. 2019 Jul 1;8(7). doi: 10.1093/gigascience/giz074.
Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference.
Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3).
Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.
哺乳动物的 X 和 Y 染色体具有共同的进化起源,并保留了高度相似的序列区域。相似的序列内容可能会混淆将短的下一代测序读取映射到参考基因组的过程。因此,参考基因组中同时存在性染色体可能会导致基因组数据中的技术伪影,并影响下游分析和应用。了解这个问题对于医学基因组学和群体基因组推断至关重要。
在这里,我们描述了序列同源性如何影响性染色体上的分析,并提出了 XYalign,这是一种新的工具,(1)从下一代测序数据中推断性染色体的组成;(2)纠正性染色体上的错误读取映射;(3)对重要质量控制指标(如映射质量、测序深度和等位基因平衡)进行制表和可视化。我们发现序列同源性会影响性染色体上的读取映射,这会对变体调用产生下游影响。然而,我们表明 XYalign 可以纠正错误映射,从而更准确地调用变体。我们还展示了如何使用 XYalign 输出的指标在各种测序实验中识别 XX 和 XY 个体,包括低覆盖度和高覆盖度全基因组测序以及外显子组测序。最后,我们讨论了如何利用 XYalign 框架的灵活性来进行其他用途,包括鉴定常染色体的非整倍性。XYalign 在 GNU 通用公共许可证(版本 3)下以开源形式提供。
性染色体序列同源性导致短读取的错误映射,进而影响下游分析。XYalign 提供了一个可重复的框架,可以纠正性染色体上的错误映射并提高变体调用的准确性。