Genetics and Reproductive Biotechnologies, Veterinary Research Institute, Central European Institute of Technology (CEITEC), 621 00 Brno, Czech Republic.
Genes (Basel). 2020 Dec 31;12(1):48. doi: 10.3390/genes12010048.
Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome-estimated 50-69%-is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from "telomere to telomere". Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.
自人类基因组计划(human genome project)引入高通量测序以来,由于人类基因组中估计有 50-69%是重复的,将短读长组装成具有足够质量的参考序列成为一个重大问题。因此,相当一部分测序读长是多映射的,即没有在基因组中唯一定位。读长是否多映射的两个关键参数是读长和基因组复杂度。长读长现在能够跨越困难的异染色质区域,包括完整的着丝粒,并从“端粒到端粒”对染色体进行特征描述。此外,基于其表观遗传标记(如甲基化模式),可以区分相同的读长或重复数组,从而辅助组装过程。尽管如此,长读长仍然包含一定比例的测序错误,这会使对齐器和组装器在准确性和速度上都感到困惑。在这里,我回顾了针对重复分辨率和多映射读问题提出并实施的解决方案,以及参考选择、重复掩蔽和性染色体适当表示的下游后果。我还考虑了长读长的即将到来的挑战和解决方案,我们预计将从单个个体内部的重复定位问题转变为泛基因组内部的重复定位问题。