Centre for Pattern Recognition and Data Analytics (PRaDA), Deakin University, Geelong, Victoria, Australia.
Centre for Molecular and Medical Research, Deakin University, Geelong, Victoria, Australia.
Am J Med Genet B Neuropsychiatr Genet. 2019 Oct;180(7):508-518. doi: 10.1002/ajmg.b.32727. Epub 2019 Apr 25.
Although neuropsychiatric disorders have an established genetic background, their molecular foundations remain elusive. This has prompted many investigators to search for explanatory biomarkers that can predict clinical outcomes. One approach uses machine learning to classify patients based on blood mRNA expression. However, these endeavors typically fail to achieve the high level of performance, stability, and generalizability required for clinical translation. Moreover, these classifiers can lack interpretability because not all genes have relevance to researchers. For this study, we hypothesized that annotation-based classifiers can improve classification performance, stability, generalizability, and interpretability. To this end, we evaluated the models of four classification algorithms on six neuropsychiatric data sets using four annotation databases. Our results suggest that the Gene Ontology Biological Process database can transform gene expression into an annotation-based feature space that is accurate and stable. We also show how annotation features can improve the interpretability of classifiers: as annotations are used to assign biological importance to genes, the biological importance of annotation-based features are the features themselves. In evaluating the annotation features, we find that top ranked annotations tend contain top ranked genes, suggesting that the most predictive annotations are a superset of the most predictive genes. Based on this, and the fact that annotations are used routinely to assign biological importance to genetic data, we recommend transforming gene-level expression into annotation-level expression prior to the classification of neuropsychiatric conditions.
虽然神经精神疾病有既定的遗传背景,但它们的分子基础仍然难以捉摸。这促使许多研究人员寻找可以预测临床结果的解释性生物标志物。一种方法是使用机器学习根据血液 mRNA 表达对患者进行分类。然而,这些努力通常无法达到临床转化所需的高性能、稳定性和通用性。此外,这些分类器可能缺乏可解释性,因为并非所有基因都与研究人员相关。在这项研究中,我们假设基于注释的分类器可以提高分类性能、稳定性、通用性和可解释性。为此,我们使用四个注释数据库评估了四种分类算法在六个神经精神数据集上的模型。我们的研究结果表明,GO 生物学过程数据库可以将基因表达转化为准确且稳定的基于注释的特征空间。我们还展示了如何使用注释特征来提高分类器的可解释性:由于注释用于为基因赋予生物学重要性,因此基于注释的特征的生物学重要性就是特征本身。在评估注释特征时,我们发现排名靠前的注释往往包含排名靠前的基因,这表明最具预测性的注释是最具预测性基因的超集。基于这一点,以及注释通常用于为遗传数据赋予生物学重要性的事实,我们建议在对神经精神疾病进行分类之前,将基因水平的表达转化为注释水平的表达。