Parkinson Edward, Liberatore Federico, Watkins W John, Andrews Robert, Edkins Sarah, Hibbert Julie, Strunk Tobias, Currie Andrew, Ghazal Peter
Department of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom.
Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom.
Front Genet. 2023 Apr 11;14:1158352. doi: 10.3389/fgene.2023.1158352. eCollection 2023.
Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data.
机器学习(ML)算法是功能强大的工具,越来越多地用于在RNA测序数据中发现脓毒症生物标志物。RNA测序数据集包含多种来源和类型的噪声(操作者、技术和非系统性噪声),这些噪声可能会使ML分类产生偏差。RNA测序工作流程中描述的标准化和独立基因过滤方法考虑了部分此类变异性,并且通常仅针对差异表达分析,而非ML应用。预处理标准化步骤显著减少了数据中的变量数量,从而提高了统计检验的功效,但可能会潜在地丢弃有价值且有洞察力的分类特征。对于基于ML的RNA测序分类的稳健性和稳定性应用转录本水平过滤的系统评估仍有待充分探索。在本报告中,我们使用弹性网正则化逻辑回归、L1正则化支持向量机和随机森林,研究了过滤掉低计数转录本和那些具有有影响的离群值读数的转录本对脓毒症生物标志物发现的下游ML分析的影响。我们证明,应用一种系统的客观策略来去除代表不同样本量数据集中高达60%转录本的无信息且可能产生偏差的生物标志物,会导致分类性能的显著提高、所得基因特征的更高稳定性,以及与先前报道的脓毒症生物标志物更好的一致性。我们还证明,基因过滤带来的性能提升取决于所选择的ML分类器,L1正则化支持向量机在我们的实验数据中表现出最大的性能提升。