Wadsworth Mark E, Page Madeline L, Aguzzoli Heberle Bernardo, Miller Justin B, Steely Cody, Ebbert Mark T W
Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY.
Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY.
bioRxiv. 2025 May 28:2025.05.23.655776. doi: 10.1101/2025.05.23.655776.
Comprehensive genomic analysis is essential for advancing our understanding of human genetics and disease. However, short-read sequencing technologies are inherently limited in their ability to resolve highly repetitive, structurally complex, and low-mappability genomic regions, previously coined as "dark" regions. Long-read sequencing technologies, such as PacBio and Oxford Nanopore Technologies (ONT), offer improved resolution of these regions, yet they are not perfect. With the advent of the new Telomere-to-Telomere (T2T) CHM13 reference genome, exploring its effect on dark regions is prudent. In this study, we systematically analyze dark regions across four human genome references-HG19, HG38 (with and without alternate contigs), and CHM13-using both short- and long-read sequencing data. We found that dark regions increase as the reference becomes more complete, especially dark-by-MAPQ regions, but that long-read sequencing significantly reduces the number of dark regions in the genome, particularly within gene bodies. However, we identify potential alignment challenges in long-read data, such as centromeric regions. These findings highlight the importance of both reference genome selection and sequencing technology choice in achieving a truly comprehensive genomic analysis.
全面的基因组分析对于推进我们对人类遗传学和疾病的理解至关重要。然而,短读长测序技术在解析高度重复、结构复杂且低映射性的基因组区域(以前称为“暗”区域)方面存在固有限制。长读长测序技术,如PacBio和牛津纳米孔技术(ONT),能更好地解析这些区域,但也并非完美。随着新的端粒到端粒(T2T)CHM13参考基因组的出现,谨慎探索其对暗区域的影响是明智的。在本研究中,我们使用短读长和长读长测序数据,系统地分析了四个人类基因组参考序列——HG19、HG38(有和没有替代重叠群)和CHM13——中的暗区域。我们发现,随着参考序列变得更加完整,暗区域会增加,尤其是基于映射质量(MAPQ)的暗区域,但长读长测序显著减少了基因组中暗区域的数量,特别是在基因体内。然而,我们在长读长数据中识别出了潜在的比对挑战,如着丝粒区域。这些发现凸显了参考基因组选择和测序技术选择在实现真正全面的基因组分析中的重要性。