Zhao Jonathan Z L, Mucaki Eliseos J, Rogan Peter K
Department of Biochemistry, Schulich School of Medicine and Dentistry, Western University, London, ON, N6A 2C1, Canada.
Department of Computer Science, Faculty of Science, Western University, London, ON, N6A 2C1, Canada.
F1000Res. 2018 Feb 27;7:233. doi: 10.12688/f1000research.14048.2. eCollection 2018.
Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches. Gene Expression Omnibus (GEO) datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998) were then ranked by Minimum Redundancy Maximum Relevance (mRMR). Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM), and validated using k-fold or traditional validation on independent datasets. The best human signatures we derived exhibit k-fold validation accuracies of up to 98% ( , , , , and ) when validated over 209 samples and traditional validation accuracies of up to 92% ( , , , , , , , , , and ) when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans). We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance of individual signatures. Several genes in the signatures we derived are present in previously proposed signatures. Gene signatures for ionizing radiation exposure derived by machine learning have low error rates in externally validated, independent datasets, and exhibit high specificity and granularity for dose estimation.
使用机器学习方法从转录组数据中得出的基因特征已显示出在生物剂量测定测试中的应用前景。这些特征对于大规模测试可能不够稳健,因为它们的性能尚未在外部独立数据集上得到充分验证。本研究利用受生物化学启发的机器学习开发了人类和小鼠特征,并使用k折交叉验证和传统方法进行了严格验证。对暴露的人类和小鼠淋巴细胞的基因表达综合数据库(GEO)数据集通过最近邻插补法进行预处理,然后根据文献中涉及的对辐射暴露有反应的基因(n = 998)的表达,采用最小冗余最大相关性(mRMR)进行排名。使用支持向量机(SVM)通过反向、完全和正向顺序特征选择得出最优特征,并在独立数据集上使用k折交叉验证或传统验证方法进行验证。我们得出的最佳人类特征在对209个样本进行验证时,k折交叉验证准确率高达98%(……),在对85个样本进行验证时,传统验证准确率高达92%(……)。一些人类特征具有足够的特异性,能够区分化疗和放疗。某些多类小鼠特征在剂量估计方面具有足够的粒度,可为细胞因子治疗的适用性提供参考(假设这些特征可以转化到人类身上)。我们编制了一份在排名前20的人类和小鼠特征中最常出现的基因列表。在一组特征中更频繁出现的基因可能表明这些基因对单个特征性能的影响更大。我们得出的特征中的几个基因也存在于先前提出的特征中。通过机器学习得出的用于电离辐射暴露的基因特征在外部验证的独立数据集中错误率较低,并且在剂量估计方面表现出高特异性和粒度。