RILITE Research Institute and AMPEL BioSolutions, 250 W Main St, Ste 300, Charlottesville, VA, 22902, USA.
Department of Physics, George Washington University, Washington, DC, 20052, USA.
Sci Rep. 2019 Jul 3;9(1):9617. doi: 10.1038/s41598-019-45989-0.
The integration of gene expression data to predict systemic lupus erythematosus (SLE) disease activity is a significant challenge because of the high degree of heterogeneity among patients and study cohorts, especially those collected on different microarray platforms. Here we deployed machine learning approaches to integrate gene expression data from three SLE data sets and used it to classify patients as having active or inactive disease as characterized by standard clinical composite outcome measures. Both raw whole blood gene expression data and informative gene modules generated by Weighted Gene Co-expression Network Analysis from purified leukocyte populations were employed with various classification algorithms. Classifiers were evaluated by 10-fold cross-validation across three combined data sets or by training and testing in independent data sets, the latter of which amplified the effects of technical variation. A random forest classifier achieved a peak classification accuracy of 83 percent under 10-fold cross-validation, but its performance could be severely affected by technical variation among data sets. The use of gene modules rather than raw gene expression was more robust, achieving classification accuracies of approximately 70 percent regardless of how the training and testing sets were formed. Fine-tuning the algorithms and parameter sets may generate sufficient accuracy to be informative as a standalone estimate of disease activity.
整合基因表达数据以预测系统性红斑狼疮 (SLE) 疾病活动度是一项重大挑战,因为患者和研究队列之间存在高度异质性,尤其是在不同的微阵列平台上收集的那些。在这里,我们部署了机器学习方法来整合来自三个 SLE 数据集的基因表达数据,并使用它来根据标准临床综合结局衡量标准将患者分类为有活性或无活性疾病。我们使用各种分类算法,分别使用原始全血基因表达数据和从纯化白细胞群体中生成的有信息的基因模块。通过在三个合并数据集之间进行 10 折交叉验证或在独立数据集上进行训练和测试来评估分类器,后者放大了技术变异的影响。随机森林分类器在 10 折交叉验证下达到了 83%的峰值分类准确性,但它的性能可能会受到数据集之间技术变异的严重影响。使用基因模块而不是原始基因表达更稳健,无论如何形成训练和测试集,都能实现约 70%的分类准确性。微调算法和参数集可能会产生足够的准确性,作为疾病活动的独立估计具有信息性。