Suppr超能文献

基于 Transformer 的基于序贯 SNP 数据的高效 HLA 推测

Efficient HLA imputation from sequential SNPs data by transformer.

机构信息

Faculty of Engineering, Kyoto University, Kyoto, Japan.

Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters, RIKEN, Tokyo, Japan.

出版信息

J Hum Genet. 2024 Oct;69(10):533-540. doi: 10.1038/s10038-024-01278-x. Epub 2024 Aug 2.

Abstract

Human leukocyte antigen (HLA) genes are associated with a variety of diseases, yet the direct typing of HLA alleles is both time-consuming and costly. Consequently, various imputation methods leveraging sequential single nucleotide polymorphisms (SNPs) data have been proposed, employing either statistical or deep learning models, such as the convolutional neural network (CNN)-based model, DEEPHLA. However, these methods exhibit limited imputation efficiency for infrequent alleles and necessitate a large size of reference dataset. In this context, we have developed a Transformer-based model to HLA allele imputation, named "HLA Reliable IMpuatioN by Transformer (HLARIMNT)" designed to exploit the sequential nature of SNPs data. We evaluated HLARIMNT's performance using two distinct reference panels; Pan-Asian reference panel (n = 530) and Type 1 Diabetes genetics Consortium (T1DGC) reference panel (n = 5225), alongside a combined panel (n = 1060). HLARIMNT demonstrated superior accuracy to DEEPHLA across several indices, particularly for infrequent alleles. Furthermore, we explored the impact of varying training data sizes on imputation accuracy, finding that HLARIMNT consistently outperformed across all data size. These findings suggest that Transformer-based models can efficiently impute not only HLA types but potentially other gene types from sequential SNPs data.

摘要

人类白细胞抗原 (HLA) 基因与多种疾病相关,但 HLA 等位基因的直接分型既耗时又昂贵。因此,已经提出了各种利用连续单核苷酸多态性 (SNP) 数据的推断方法,采用统计或深度学习模型,如基于卷积神经网络 (CNN) 的模型 DEEPHLA。然而,这些方法对于罕见等位基因的推断效率有限,并且需要大型参考数据集。在这种情况下,我们开发了一种基于 Transformer 的 HLA 等位基因推断模型,名为“基于 Transformer 的 HLA 可靠推断 (HLARIMNT)”,旨在利用 SNPs 数据的顺序性质。我们使用两个不同的参考面板;泛亚参考面板 (n = 530) 和 1 型糖尿病遗传学联合会 (T1DGC) 参考面板 (n = 5225),以及一个组合面板 (n = 1060) 来评估 HLARIMNT 的性能。HLARIMNT 在几个指标上的准确性均优于 DEEPHLA,特别是对于罕见等位基因。此外,我们探讨了训练数据大小对推断准确性的影响,发现 HLARIMNT 在所有数据大小上的表现都优于其他模型。这些发现表明,基于 Transformer 的模型不仅可以从连续的 SNP 数据中高效推断 HLA 类型,还可以推断其他潜在的基因类型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/99ef/11422163/905debf9b94a/10038_2024_1278_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验