Jónsson Benedikt A, Halldórsson Gísli H, Árdal Steinþór, Rögnvaldsson Sölvi, Einarsson Eyþór, Sulem Patrick, Guðbjartsson Daníel F, Melsted Páll, Stefánsson Kári, Úlfarsson Magnús Ö
deCODE Genetics/Amgen Inc., Reykjavik, Iceland.
University of Iceland, Reykjavik, Iceland.
Commun Biol. 2024 Dec 4;7(1):1616. doi: 10.1038/s42003-024-07298-9.
Mutations that affect RNA splicing significantly impact human diversity and disease. Here we present a method using transformers, a type of machine learning model, to detect splicing from raw 45,000-nucleotide sequences. We generate embeddings with residual neural networks and apply hard attention to select splice site candidates, enabling efficient training on long sequences. Our method surpasses the leading tool, SpliceAI, in detecting splice sites in GENCODE and ENSEMBL annotations. Using extensive RNA sequencing data from an Icelandic cohort of 17,848 individuals and the Genotype-Tissue Expression (GTEx) project, our method demonstrates superior performance in detecting splice junctions compared to SpliceAI-10k (PR-AUC = 0.834 vs. PR-AUC = 0.820) and is more effective at identifying disease-related splice variants in ClinVar (PR-AUC = 0.997 vs. PR-AUC = 0.996). These advancements hold promise for improving genetic research and clinical diagnostics, potentially leading to better understanding and treatment of splicing-related diseases.
影响RNA剪接的突变对人类多样性和疾病有重大影响。在此,我们提出一种使用变压器(一种机器学习模型)从45000个核苷酸的原始序列中检测剪接的方法。我们用残差神经网络生成嵌入,并应用硬注意力来选择剪接位点候选,从而能够对长序列进行高效训练。我们的方法在检测GENCODE和ENSEMBL注释中的剪接位点方面超越了领先工具SpliceAI。利用来自冰岛17848名个体队列的大量RNA测序数据以及基因型-组织表达(GTEx)项目,我们的方法在检测剪接连接方面比SpliceAI-10k表现更优(PR-AUC = 0.834 vs. PR-AUC = 0.820),并且在ClinVar中识别与疾病相关的剪接变体方面更有效(PR-AUC = 0.997 vs. PR-AUC = 0.996)。这些进展有望改善基因研究和临床诊断,可能有助于更好地理解和治疗与剪接相关的疾病。