Wen Huaming, Yang Jinbao, Zhao Xianjia, Wang Xingbin, Lei Jiawei, Li Yanchun, Du Wenjie, Li Dongxi, Xu Yun, Lonardi Stefano, Pan Weihua
School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230027, China.
State Key Laboratory of Genome and Multi-Omics Technologies, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China.
Genome Biol. 2025 Jul 28;26(1):227. doi: 10.1186/s13059-025-03685-5.
The highly repetitive content of eukaryotic genomes, including long tandem repeats, segmental duplications, and centromeres, makes haplotype-resolved genome assembly hard. Repeat sequences introduce gaps or mis-joins in the assemblies. We introduce TRFill, a novel algorithm that can close the gaps in a draft chromosome-level assembly using exclusively PacBio HiFi and Hi-C data. Experimental results on human centromeres and tomato subtelomeres show that TRFill can improve the completeness and correctness of about two-thirds of the tandem repeats. We also show that the improved completeness of subtelomeric tandem repeats in the tomato pangenome enables a population-level analysis of these complex repeats.
真核生物基因组中高度重复的内容,包括长串联重复序列、片段重复和着丝粒,使得单倍型解析的基因组组装变得困难。重复序列会在组装中引入缺口或错误连接。我们引入了TRFill,这是一种新颖的算法,它可以仅使用PacBio HiFi和Hi-C数据来填补草图染色体水平组装中的缺口。在人类着丝粒和番茄亚端粒上的实验结果表明,TRFill可以提高约三分之二串联重复序列的完整性和正确性。我们还表明,番茄泛基因组中亚端粒串联重复序列完整性的提高使得能够对这些复杂重复序列进行群体水平的分析。