Faculty of Engineering, Kyoto University, Kyoto, Japan.
Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters, RIKEN, Tokyo, Japan.
J Hum Genet. 2024 Oct;69(10):533-540. doi: 10.1038/s10038-024-01278-x. Epub 2024 Aug 2.
Human leukocyte antigen (HLA) genes are associated with a variety of diseases, yet the direct typing of HLA alleles is both time-consuming and costly. Consequently, various imputation methods leveraging sequential single nucleotide polymorphisms (SNPs) data have been proposed, employing either statistical or deep learning models, such as the convolutional neural network (CNN)-based model, DEEPHLA. However, these methods exhibit limited imputation efficiency for infrequent alleles and necessitate a large size of reference dataset. In this context, we have developed a Transformer-based model to HLA allele imputation, named "HLA Reliable IMpuatioN by Transformer (HLARIMNT)" designed to exploit the sequential nature of SNPs data. We evaluated HLARIMNT's performance using two distinct reference panels; Pan-Asian reference panel (n = 530) and Type 1 Diabetes genetics Consortium (T1DGC) reference panel (n = 5225), alongside a combined panel (n = 1060). HLARIMNT demonstrated superior accuracy to DEEPHLA across several indices, particularly for infrequent alleles. Furthermore, we explored the impact of varying training data sizes on imputation accuracy, finding that HLARIMNT consistently outperformed across all data size. These findings suggest that Transformer-based models can efficiently impute not only HLA types but potentially other gene types from sequential SNPs data.
人类白细胞抗原 (HLA) 基因与多种疾病相关,但 HLA 等位基因的直接分型既耗时又昂贵。因此,已经提出了各种利用连续单核苷酸多态性 (SNP) 数据的推断方法,采用统计或深度学习模型,如基于卷积神经网络 (CNN) 的模型 DEEPHLA。然而,这些方法对于罕见等位基因的推断效率有限,并且需要大型参考数据集。在这种情况下,我们开发了一种基于 Transformer 的 HLA 等位基因推断模型,名为“基于 Transformer 的 HLA 可靠推断 (HLARIMNT)”,旨在利用 SNPs 数据的顺序性质。我们使用两个不同的参考面板;泛亚参考面板 (n = 530) 和 1 型糖尿病遗传学联合会 (T1DGC) 参考面板 (n = 5225),以及一个组合面板 (n = 1060) 来评估 HLARIMNT 的性能。HLARIMNT 在几个指标上的准确性均优于 DEEPHLA,特别是对于罕见等位基因。此外,我们探讨了训练数据大小对推断准确性的影响,发现 HLARIMNT 在所有数据大小上的表现都优于其他模型。这些发现表明,基于 Transformer 的模型不仅可以从连续的 SNP 数据中高效推断 HLA 类型,还可以推断其他潜在的基因类型。