• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用新生儿败血症RNA测序数据进行机器学习引导的生物标志物发现的基因筛选策略。

Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data.

作者信息

Parkinson Edward, Liberatore Federico, Watkins W John, Andrews Robert, Edkins Sarah, Hibbert Julie, Strunk Tobias, Currie Andrew, Ghazal Peter

机构信息

Department of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom.

Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom.

出版信息

Front Genet. 2023 Apr 11;14:1158352. doi: 10.3389/fgene.2023.1158352. eCollection 2023.

DOI:10.3389/fgene.2023.1158352
PMID:37113992
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10126415/
Abstract

Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data.

摘要

机器学习(ML)算法是功能强大的工具,越来越多地用于在RNA测序数据中发现脓毒症生物标志物。RNA测序数据集包含多种来源和类型的噪声(操作者、技术和非系统性噪声),这些噪声可能会使ML分类产生偏差。RNA测序工作流程中描述的标准化和独立基因过滤方法考虑了部分此类变异性,并且通常仅针对差异表达分析,而非ML应用。预处理标准化步骤显著减少了数据中的变量数量,从而提高了统计检验的功效,但可能会潜在地丢弃有价值且有洞察力的分类特征。对于基于ML的RNA测序分类的稳健性和稳定性应用转录本水平过滤的系统评估仍有待充分探索。在本报告中,我们使用弹性网正则化逻辑回归、L1正则化支持向量机和随机森林,研究了过滤掉低计数转录本和那些具有有影响的离群值读数的转录本对脓毒症生物标志物发现的下游ML分析的影响。我们证明,应用一种系统的客观策略来去除代表不同样本量数据集中高达60%转录本的无信息且可能产生偏差的生物标志物,会导致分类性能的显著提高、所得基因特征的更高稳定性,以及与先前报道的脓毒症生物标志物更好的一致性。我们还证明,基因过滤带来的性能提升取决于所选择的ML分类器,L1正则化支持向量机在我们的实验数据中表现出最大的性能提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/1e17e683d4a9/fgene-14-1158352-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/a34756843c36/fgene-14-1158352-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/58b92d0e3bd5/fgene-14-1158352-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/f5cd08282615/fgene-14-1158352-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/83ea8f40f3b3/fgene-14-1158352-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/1e17e683d4a9/fgene-14-1158352-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/a34756843c36/fgene-14-1158352-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/58b92d0e3bd5/fgene-14-1158352-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/f5cd08282615/fgene-14-1158352-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/83ea8f40f3b3/fgene-14-1158352-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7afd/10126415/1e17e683d4a9/fgene-14-1158352-g005.jpg

相似文献

1
Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data.使用新生儿败血症RNA测序数据进行机器学习引导的生物标志物发现的基因筛选策略。
Front Genet. 2023 Apr 11;14:1158352. doi: 10.3389/fgene.2023.1158352. eCollection 2023.
2
A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study.一种基于放射组学的方法来评估肿瘤浸润 CD8 细胞与抗 PD-1 或抗 PD-L1 免疫治疗反应的关系:一项影像学生物标志物、回顾性多队列研究。
Lancet Oncol. 2018 Sep;19(9):1180-1191. doi: 10.1016/S1470-2045(18)30413-3. Epub 2018 Aug 14.
3
Immune cell type signature discovery and random forest classification for analysis of single cell gene expression datasets.免疫细胞类型特征发现和随机森林分类用于分析单细胞基因表达数据集。
Front Immunol. 2023 Aug 4;14:1194745. doi: 10.3389/fimmu.2023.1194745. eCollection 2023.
4
Machine learning for cell type classification from single nucleus RNA sequencing data.基于单细胞 RNA 测序数据的细胞类型分类的机器学习方法。
PLoS One. 2022 Sep 23;17(9):e0275070. doi: 10.1371/journal.pone.0275070. eCollection 2022.
5
voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data.voomDDA:诊断生物标志物的发现与RNA测序数据分类
PeerJ. 2017 Oct 6;5:e3890. doi: 10.7717/peerj.3890. eCollection 2017.
6
Optimal Gene Filtering for Single-Cell data (OGFSC)-a gene filtering algorithm for single-cell RNA-seq data.单细胞数据最优基因过滤算法(OGFSC)——一种用于单细胞 RNA-seq 数据的基因过滤算法。
Bioinformatics. 2019 Aug 1;35(15):2602-2609. doi: 10.1093/bioinformatics/bty1016.
7
Assessing the complementary information from an increased number of biologically relevant features in liquid biopsy-derived RNA-Seq data.评估液体活检来源的RNA测序数据中数量增加的生物学相关特征所提供的补充信息。
Heliyon. 2024 Mar 12;10(6):e27360. doi: 10.1016/j.heliyon.2024.e27360. eCollection 2024 Mar 30.
8
Differentiating between liver diseases by applying multiclass machine learning approaches to transcriptomics of liver tissue or blood-based samples.通过将多类机器学习方法应用于肝组织或血液样本的转录组学来区分肝脏疾病。
JHEP Rep. 2022 Aug 18;4(10):100560. doi: 10.1016/j.jhepr.2022.100560. eCollection 2022 Oct.
9
The Vacc-SeqQC project: Benchmarking RNA-Seq for clinical vaccine studies.Vacc-SeqQC 项目:基于 RNA-Seq 对临床疫苗研究进行基准测试。
Front Immunol. 2023 Jan 19;13:1093242. doi: 10.3389/fimmu.2022.1093242. eCollection 2022.
10
GeneSelectML: a comprehensive way of gene selection for RNA-Seq data via machine learning algorithms.基因选择机器学习方法(GeneSelectML):一种通过机器学习算法对RNA测序数据进行基因选择的综合方法。
Med Biol Eng Comput. 2023 Jan;61(1):229-241. doi: 10.1007/s11517-022-02695-w. Epub 2022 Nov 10.

引用本文的文献

1
Constructing a predictive model for early-onset sepsis in neonatal intensive care unit newborns based on SHapley Additive exPlanations explainable machine learning.基于SHapley加性解释可解释机器学习构建新生儿重症监护病房新生儿早发性败血症的预测模型。
Transl Pediatr. 2024 Nov 30;13(11):1933-1946. doi: 10.21037/tp-24-278. Epub 2024 Nov 26.
2
Decoding Sepsis-Induced Disseminated Intravascular Coagulation: A Comprehensive Review of Existing and Emerging Therapies.解读脓毒症诱导的弥散性血管内凝血:现有及新兴疗法的全面综述
J Clin Med. 2023 Sep 22;12(19):6128. doi: 10.3390/jcm12196128.

本文引用的文献

1
Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes.对RNA测序数据进行建模和清理可显著提高差异表达基因的检测能力。
BMC Bioinformatics. 2022 Nov 16;23(1):488. doi: 10.1186/s12859-022-05023-z.
2
GeneSelectML: a comprehensive way of gene selection for RNA-Seq data via machine learning algorithms.基因选择机器学习方法(GeneSelectML):一种通过机器学习算法对RNA测序数据进行基因选择的综合方法。
Med Biol Eng Comput. 2023 Jan;61(1):229-241. doi: 10.1007/s11517-022-02695-w. Epub 2022 Nov 10.
3
An immune dysfunction score for stratification of patients with acute infection based on whole-blood gene expression.
基于全血基因表达的急性感染患者分层免疫功能评分。
Sci Transl Med. 2022 Nov 2;14(669):eabq4433. doi: 10.1126/scitranslmed.abq4433.
4
Impact of adaptive filtering on power and false discovery rate in RNA-seq experiments.自适应滤波对 RNA-seq 实验中功率和假发现率的影响。
BMC Bioinformatics. 2022 Sep 24;23(1):388. doi: 10.1186/s12859-022-04928-z.
5
Characterisation of the Circulating Transcriptomic Landscape in Inflammatory Bowel Disease Provides Evidence for Dysregulation of Multiple Transcription Factors Including NFE2, SPI1, CEBPB, and IRF2.炎症性肠病循环转录组特征分析提供了多个转录因子失调的证据,包括 NFE2、SPI1、CEBPB 和 IRF2。
J Crohns Colitis. 2022 Aug 30;16(8):1255-1268. doi: 10.1093/ecco-jcc/jjac033.
6
Probabilistic outlier identification for RNA sequencing generalized linear models.RNA测序广义线性模型的概率异常值识别
NAR Genom Bioinform. 2021 Mar 1;3(1):lqab005. doi: 10.1093/nargab/lqab005. eCollection 2021 Mar.
7
Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions.基于机器学习的计算基因选择模型:综述、性能评估、开放问题及未来研究方向
Front Genet. 2020 Dec 10;11:603808. doi: 10.3389/fgene.2020.603808. eCollection 2020.
8
Whole blood transcriptional responses of very preterm infants during late-onset sepsis.极早产儿晚发型败血症时的全血转录反应。
PLoS One. 2020 Jun 1;15(6):e0233841. doi: 10.1371/journal.pone.0233841. eCollection 2020.
9
RNA sequencing: the teenage years.RNA 测序:青少年时期。
Nat Rev Genet. 2019 Nov;20(11):631-656. doi: 10.1038/s41576-019-0150-2. Epub 2019 Jul 24.
10
MLSeq: Machine learning interface for RNA-sequencing data.MLSeq:用于 RNA-seq 数据的机器学习接口。
Comput Methods Programs Biomed. 2019 Jul;175:223-231. doi: 10.1016/j.cmpb.2019.04.007. Epub 2019 Apr 29.