Carrion Jackson, Nandakumar Rohit, Shi Xiaojian, Gu Haiwei, Kim Yookyung, Raskind Wendy H, Peter Beate, Dinu Valentin
College of Health Solutions, Arizona State University, Phoenix, AZ 85004.
Cellular and Molecular Physiology Department, Yale School of Medicine, New Haven, CT 06510.
bioRxiv. 2023 Feb 27:2023.02.27.530280. doi: 10.1101/2023.02.27.530280.
This exploratory study tested and validated the use of data fusion and machine learning techniques to probe high-throughput omics and clinical data with a goal of exploring the etiology of developmental dyslexia. Developmental dyslexia is the leading learning disability in school aged children affecting roughly 5-10% of the US population. The complex biological and neurological phenotype of this life altering disability complicates its diagnosis. Phenome, exome, and metabolome data was collected allowing us to fully explore this system from a behavioral, cellular, and molecular point of view. This study provides a proof of concept showing that data fusion and ensemble learning techniques can outperform traditional machine learning techniques when provided small and complex multi-omics and clinical datasets. Heterogenous stacking classifiers consisting of single-omic experts/models achieved an accuracy of 86%, F1 score of 0.89, and AUC value of 0.83. Ensemble methods also provided a ranked list of important features that suggests exome single nucleotide polymorphisms found in the thalamus and cerebellum could be potential biomarkers for developmental dyslexia and heavily influenced the classification of DD within our machine learning models.
这项探索性研究测试并验证了数据融合和机器学习技术在探测高通量组学和临床数据中的应用,目的是探究发育性阅读障碍的病因。发育性阅读障碍是学龄儿童中最主要的学习障碍,影响着约5%-10%的美国人口。这种改变人生的残疾所具有的复杂生物学和神经学表型使其诊断变得复杂。收集了表型组、外显子组和代谢组数据,使我们能够从行为、细胞和分子角度全面探索这个系统。本研究提供了一个概念验证,表明当提供小而复杂的多组学和临床数据集时,数据融合和集成学习技术可以优于传统机器学习技术。由单一组学专家/模型组成的异构堆叠分类器实现了86%的准确率、0.89的F1分数和0.83的AUC值。集成方法还提供了一份重要特征的排名列表,表明在丘脑和小脑中发现的外显子组单核苷酸多态性可能是发育性阅读障碍的潜在生物标志物,并在我们的机器学习模型中对发育性阅读障碍的分类有重大影响。