Bioinformatics and Systems Biology Graduate Program, University of California, La Jolla, San Diego, CA, 92093, USA.
Department of Pediatrics, School of Medicine, University of California, La Jolla, San Diego, CA, 92093, USA.
Nat Commun. 2022 Jun 9;13(1):3221. doi: 10.1038/s41467-022-30930-3.
The human genome contains hundreds of low-copy repeats (LCRs) that are challenging to analyze using short-read sequencing technologies due to extensive copy number variation and ambiguity in read mapping. Copy number and sequence variants in more than 150 duplicated genes that overlap LCRs have been implicated in monogenic and complex human diseases. We describe a computational tool, Parascopy, for estimating the aggregate and paralog-specific copy number of duplicated genes using whole-genome sequencing (WGS). Parascopy is an efficient method that jointly analyzes reads mapped to different repeat copies without the need for global realignment. It leverages multiple samples to mitigate sequencing bias and to identify reliable paralogous sequence variants (PSVs) that differentiate repeat copies. Analysis of WGS data for 2504 individuals from diverse populations showed that Parascopy is robust to sequencing bias, has higher accuracy compared to existing methods and enables prioritization of pathogenic copy number changes in duplicated genes.
人类基因组包含数百个低拷贝重复序列(LCRs),由于广泛的拷贝数变异和读映射的模糊性,使用短读测序技术分析这些重复序列具有挑战性。重叠 LCR 的 150 多个重复基因的拷贝数和序列变异与单基因和复杂人类疾病有关。我们描述了一种计算工具 Parascopy,用于使用全基因组测序(WGS)估计重复基因的总拷贝数和基因特异性拷贝数。Parascopy 是一种高效的方法,无需全局重新比对即可联合分析映射到不同重复拷贝的读取。它利用多个样本来减轻测序偏差,并识别可区分重复拷贝的可靠同源序列变异(PSVs)。对来自不同人群的 2504 个人的 WGS 数据进行分析表明,Parascopy 对测序偏差具有鲁棒性,与现有方法相比具有更高的准确性,并能够优先考虑重复基因中致病性拷贝数变化。