Reyes Diego Machado, Kim Mansu, Chao Hanqing, Hahn Juergen, Shen Li, Yan Pingkun
Dept. of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, New York, USA.
Dept. of Artificial Intelligence, Catholic University of Korea, Bucheon, Republic of Korea.
IEEE EMBS Int Conf Biomed Health Inform. 2022 Sep;2022. doi: 10.1109/bhi56158.2022.9926815. Epub 2022 Nov 4.
Parkinson's disease (PD) is the second most common neurodegenerative disease and presents a complex etiology with genomic and environmental factors and no recognized cures. Genotype data, such as single nucleotide polymorphisms (SNPs), could be used as a prodromal factor for early detection of PD. However, the polygenic nature of PD presents a challenge as the complex relationships between SNPs towards disease development are difficult to model. Traditional assessment methods such as polygenic risk scores and machine learning approaches struggle to capture the complex interactions present in the genotype data, thus limiting their discriminative capabilities in diagnosis. On the other hand, deep learning models are better suited for this task. Nevertheless, they encounter difficulties of their own such as a lack of interpretability. To overcome these limitations, in this work, a novel transformer encoder-based model is introduced to classify PD patients from healthy controls based on their genotype. This method is designed to effectively model complex global feature interactions and enable increased interpretability through the learned attention scores. The proposed framework outperformed traditional machine learning models and multilayer perceptron (MLP) baseline models. Moreover, visualization of the learned SNP-SNP associations provides not only interpretability to the model but also valuable insights into the biochemical pathways underlying PD development, which are corroborated by pathway enrichment analysis. Our results suggest novel SNP interactions to be further studied in wet lab and clinical settings.
帕金森病(PD)是第二常见的神经退行性疾病,其病因复杂,涉及基因组和环境因素,且尚无公认的治愈方法。基因型数据,如单核苷酸多态性(SNP),可作为PD早期检测的前驱因素。然而,PD的多基因性质带来了挑战,因为SNP与疾病发展之间的复杂关系难以建模。传统的评估方法,如多基因风险评分和机器学习方法,难以捕捉基因型数据中存在的复杂相互作用,从而限制了它们在诊断中的判别能力。另一方面,深度学习模型更适合这项任务。然而,它们也有自身的困难,比如缺乏可解释性。为了克服这些限制,在这项工作中,引入了一种基于新型变压器编码器的模型,根据基因型对PD患者和健康对照进行分类。该方法旨在有效地对复杂的全局特征相互作用进行建模,并通过学习到的注意力分数提高可解释性。所提出的框架优于传统的机器学习模型和多层感知器(MLP)基线模型。此外,对学习到的SNP-SNP关联进行可视化,不仅为模型提供了可解释性,还为PD发展背后的生化途径提供了有价值的见解,通路富集分析证实了这些见解。我们的结果表明,新的SNP相互作用有待在湿实验室和临床环境中进一步研究。