Shi Ying, Wu Chenxu, Luo Shifu, Zhang Songming, Wang Wenjian, Li Jinyan
School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi Province, China.
Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518000, Guangdong, China.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf430.
Accurate calling of parental-child SNPs and Indels in family trios is very helpful for understanding genetic traits and diseases. Indel calling is even more important than SNP calling, as Indels may have led to substantial changes in protein structures that affect more of the traits of the organism. However, the best Indel calling methods have recall rates below 85%, precision below 92%, and F1 below 88% on $60\times $ ONT Q20 data, much lower than their SNP calling's recall performance of 99.87%, precision of 99.86%, and F1 of 99.86%. Difficulties in Indels calling include how to distinguish sequencing errors from genuine Indels and how to optimize the Mendelian genetic model. This work proposes sparse attention learning for high-performance calling of Indels from family-trios' ONT long-read sequencing data, while still maintaining exceptional performance on SNP calling. Key steps include a sparsely connected attention network to convert fully aligned data cubes into essential features, and a deep learning on these features via ResNet and 3D convolutional blocks to enable accurate detection of family-trio variants. This attention network is in fact a dual attention network to aggregate both channel and spatial information, capable of selecting sub-cubes of critical channels and base locations that are resistant to the confounding effects of sequencing errors. Comparing with the current best-performing trio-variant detection method, our F1 is 5.6%-14.19% higher, recall is 7.07%-18.67% higher, and precision is 3.85%-7.87% higher on ONT Q20 datasets. Case studies of indel-dense regions in chromosome 20, including the centromere and disease-associated genes, demonstrate the significant impact of indel variations on disease pathogenesis, providing novel perspectives for future personalized and targeted therapies.
准确识别家系三联体中的亲子单核苷酸多态性(SNP)和插入缺失(Indel)对于理解遗传特征和疾病非常有帮助。Indel识别比SNP识别更为重要,因为Indel可能导致蛋白质结构发生重大变化,从而影响生物体的更多性状。然而,在60×ONT Q20数据上,最佳的Indel识别方法召回率低于85%,精确率低于92%,F1值低于88%,远低于其SNP识别的召回性能(99.87%)、精确率(99.86%)和F1值(99.86%)。Indel识别的困难包括如何区分测序错误和真正的Indel,以及如何优化孟德尔遗传模型。这项工作提出了稀疏注意力学习方法,用于从家系三联体的ONT长读长测序数据中进行高性能的Indel识别,同时在SNP识别方面仍保持优异性能。关键步骤包括一个稀疏连接的注意力网络,将完全对齐的数据块转换为基本特征,并通过残差网络(ResNet)和3D卷积块对这些特征进行深度学习,以实现对家系三联体变异的准确检测。这个注意力网络实际上是一个双注意力网络,用于聚合通道和空间信息,能够选择关键通道和碱基位置的子块,以抵抗测序错误的混杂影响。与当前性能最佳的三联体变异检测方法相比,在ONT Q20数据集上,我们的F1值高5.6%-14.19%,召回率高7.07%-18.67%,精确率高3.85%-7.87%。对20号染色体上Indel密集区域(包括着丝粒和疾病相关基因)的案例研究表明,Indel变异对疾病发病机制有重大影响,为未来的个性化和靶向治疗提供了新的视角。