评估关键数据处理步骤，以确保从大量 RNA-seq 数据中可靠地预测基因共表达。

Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data.

机构信息

Institute for Frontier Life and Medical Sciences, Kyoto University, Kyoto, Japan.

Institute for Liberal Arts and Sciences, Kyoto University, Kyoto, Japan.

出版信息

PLoS One. 2022 Jan 28;17(1):e0263344. doi: 10.1371/journal.pone.0263344. eCollection 2022.

DOI:10.1371/journal.pone.0263344

PMID:35089979

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8797241/

Abstract

MOTIVATION

Gene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for the accurate prediction of gene co-expression from such large datasets remain unclear. Especially the importance of batch effect correction is understudied.

RESULTS

We processed RNA-seq data of 68 human and 76 mouse cell types and tissues using 50 different workflows into 7,200 genome-wide gene co-expression networks. We then conducted a systematic analysis of the factors that result in high-quality co-expression predictions, focusing on normalization, batch effect correction, and measure of correlation. We confirmed the key importance of high sample counts for high-quality predictions. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression estimates, equivalent to a >80% and >40% increase in samples. In larger datasets, batch effect removal was equivalent to a more than doubling of the sample size. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets.

CONCLUSION

A key point for accurate prediction of gene co-expression is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates.

摘要

动机

基因共表达分析是一种很有吸引力的工具，可利用大量公共 RNA-seq 数据集来预测基因功能和调控机制。然而，从如此庞大的数据集准确预测基因共表达的最佳数据处理步骤仍不清楚。特别是批次效应校正的重要性还没有得到充分研究。

结果

我们使用 50 种不同的工作流程处理了 68 个人类和 76 种小鼠细胞类型和组织的 RNA-seq 数据，将其转化为 7200 个全基因组基因共表达网络。然后，我们对导致高质量共表达预测的因素进行了系统分析，重点是归一化、批次效应校正和相关度量。我们证实了高样本数量对于高质量预测的关键重要性。然而，选择合适的归一化方法并应用批次效应校正可以进一步提高共表达估计的质量，相当于样本数量增加 80%以上和 40%以上。在更大的数据集上，去除批次效应相当于将样本量增加一倍以上。最后，Pearson 相关比 Spearman 相关更适用，除非是较小的数据集。

结论

准确预测基因共表达的一个关键点是收集大量样本。然而，注意数据归一化、批次效应和相关度量可以显著提高共表达估计的质量。

相似文献

Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data.评估关键数据处理步骤，以确保从大量 RNA-seq 数据中可靠地预测基因共表达。

PLoS One. 2022 Jan 28;17(1):e0263344. doi: 10.1371/journal.pone.0263344. eCollection 2022.

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data.从 RNA-seq 数据构建基因共表达网络的稳健归一化和转换技术。

Genome Biol. 2022 Jan 3;23(1):1. doi: 10.1186/s13059-021-02568-9.

Independent component analysis based gene co-expression network inference (ICAnet) to decipher functional modules for better single-cell clustering and batch integration.基于独立成分分析的基因共表达网络推断 (ICAnet) 以破译功能模块，从而更好地进行单细胞聚类和批次整合。

Nucleic Acids Res. 2021 May 21;49(9):e54. doi: 10.1093/nar/gkab089.

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.基于机器学习的 RNA-seq 数据质量自动评估进行批次效应检测和校正。

BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.

PrismEXP: gene annotation prediction from stratified gene-gene co-expression matrices.PrismEXP：基于分层基因-基因共表达矩阵的基因注释预测。

PeerJ. 2023 Feb 27;11:e14927. doi: 10.7717/peerj.14927. eCollection 2023.

Batch effect correction for genome-wide methylation data with Illumina Infinium platform.基于 Illumina Infinium 平台的全基因组甲基化数据的批次效应校正。

BMC Med Genomics. 2011 Dec 16;4:84. doi: 10.1186/1755-8794-4-84.

Metric learning on expression data for gene function prediction.基于表达数据的度量学习进行基因功能预测。

Bioinformatics. 2020 Feb 15;36(4):1182-1190. doi: 10.1093/bioinformatics/btz731.

Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data.比较Illumina高通量RNA测序数据差异分析的标准化方法。

BMC Bioinformatics. 2015 Oct 28;16:347. doi: 10.1186/s12859-015-0778-7.

Identification of lncRNAs-gene interactions in transcription regulation based on co-expression analysis of RNA-seq data.基于 RNA-seq 数据的共表达分析鉴定转录调控中的 lncRNA-基因相互作用。

Math Biosci Eng. 2019 Aug 5;16(6):7112-7125. doi: 10.3934/mbe.2019357.

Comparison of confound adjustment methods in the construction of gene co-expression networks.比较基因共表达网络构建中混杂因素调整方法。

Genome Biol. 2022 Feb 3;23(1):44. doi: 10.1186/s13059-022-02606-0.

引用本文的文献

Sex-Dependent Relationships Between PFAS and Placental Transcriptomics Identified by Weighted Gene Co-Expression Analysis.通过加权基因共表达分析确定的全氟烷基和多氟烷基物质（PFAS）与胎盘转录组学之间的性别依赖性关系。

medRxiv. 2025 Jun 24:2025.06.23.25330157. doi: 10.1101/2025.06.23.25330157.

Gene2role: a role-based gene embedding method for comparative analysis of signed gene regulatory networks.Gene2role：一种用于带符号基因调控网络比较分析的基于角色的基因嵌入方法。

BMC Bioinformatics. 2025 May 24;26(1):134. doi: 10.1186/s12859-025-06128-x.

Single-cell network biology enabling cell-type-resolved disease genetics.单细胞网络生物学助力细胞类型解析的疾病遗传学研究。

Genomics Inform. 2025 Mar 27;23(1):10. doi: 10.1186/s44342-025-00042-7.

CoGTEx: Unscaled system-level coexpression estimation from GTEx data forecast novel functional gene partners.CoGTEx：从 GTEx 数据预测新的功能基因伙伴的无标度系统水平共表达估计。

PLoS One. 2024 Oct 4;19(10):e0309961. doi: 10.1371/journal.pone.0309961. eCollection 2024.

Correlation-based network integration of lung RNA sequencing and DNA methylation data in chronic obstructive pulmonary disease.慢性阻塞性肺疾病中基于相关性的肺RNA测序与DNA甲基化数据的网络整合

Heliyon. 2024 May 15;10(10):e31301. doi: 10.1016/j.heliyon.2024.e31301. eCollection 2024 May 30.

Identification of ligand and receptor interactions in CKD and MASH through the integration of single cell and spatial transcriptomics.通过单细胞和空间转录组学的整合，鉴定 CKD 和 MASH 中的配体和受体相互作用。

PLoS One. 2024 May 20;19(5):e0302853. doi: 10.1371/journal.pone.0302853. eCollection 2024.

Network Analysis of Publicly Available RNA-seq Provides Insights into the Molecular Mechanisms of Plant Defense against Multiple Fungal Pathogens in .基于公开 RNA-seq 数据的网络分析揭示拟南芥抵御多种真菌病原体的分子机制

Genes (Basel). 2023 Dec 16;14(12):2223. doi: 10.3390/genes14122223.

A universal tool for predicting differentially active features in single-cell and spatial genomics data.一种用于预测单细胞和空间基因组学数据中差异活性特征的通用工具。

Sci Rep. 2023 Jul 22;13(1):11830. doi: 10.1038/s41598-023-38965-2.

COXPRESdb v8: an animal gene coexpression database navigating from a global view to detailed investigations.COXPRESdb v8：一个从全局视角到详细研究的动物基因共表达数据库。

Nucleic Acids Res. 2023 Jan 6;51(D1):D80-D87. doi: 10.1093/nar/gkac983.

Approaches in Gene Coexpression Analysis in Eukaryotes.真核生物基因共表达分析方法

Biology (Basel). 2022 Jul 6;11(7):1019. doi: 10.3390/biology11071019.

本文引用的文献

Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis.系统比较和评估 RNA-seq 程序进行基因表达定量分析。

Sci Rep. 2020 Nov 12;10(1):19737. doi: 10.1038/s41598-020-76881-x.

: batch effect adjustment for RNA-seq count data.RNA测序计数数据的批次效应调整

NAR Genom Bioinform. 2020 Sep;2(3):lqaa078. doi: 10.1093/nargab/lqaa078. Epub 2020 Sep 21.

Simulating ComBat: how batch correction can lead to the systematic introduction of false positive results in DNA methylation microarray studies.模拟 ComBat：批次校正如何导致 DNA 甲基化微阵列研究中系统地引入假阳性结果。

BMC Bioinformatics. 2020 Jun 30;21(1):271. doi: 10.1186/s12859-020-03559-6.

COXPRESdb v7: a gene coexpression database for 11 animal species supported by 23 coexpression platforms for technical evaluation and evolutionary inference.COXPRESdb v7：一个支持 23 个共表达平台的 11 种动物基因共表达数据库，用于技术评估和进化推理。

Nucleic Acids Res. 2019 Jan 8;47(D1):D55-D62. doi: 10.1093/nar/gky1155.

Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing.高通量测序完成的基因表达研究中标准化方法的比较。

PLoS One. 2018 Oct 31;13(10):e0206312. doi: 10.1371/journal.pone.0206312. eCollection 2018.

Adjusting for Batch Effects in DNA Methylation Microarray Data, a Lesson Learned.DNA甲基化微阵列数据中批次效应的校正：经验教训

Front Genet. 2018 Mar 16;9:83. doi: 10.3389/fgene.2018.00083. eCollection 2018.

The RNASeq-er API-a gateway to systematically updated analysis of public RNA-seq data.RNASeq-er API-系统更新公共 RNA-seq 数据分析的门户。

Bioinformatics. 2017 Jul 15;33(14):2218-2220. doi: 10.1093/bioinformatics/btx143.

Gene co-expression analysis for functional classification and gene-disease predictions.基因共表达分析用于功能分类和基因疾病预测。

Brief Bioinform. 2018 Jul 20;19(4):575-592. doi: 10.1093/bib/bbw139.

EGAD: ultra-fast functional analysis of gene networks.EGAD：基因网络的超快速功能分析

Bioinformatics. 2017 Feb 15;33(4):612-614. doi: 10.1093/bioinformatics/btw695.

Learning from Co-expression Networks: Possibilities and Challenges.从共表达网络中学习：可能性与挑战。

Front Plant Sci. 2016 Apr 8;7:444. doi: 10.3389/fpls.2016.00444. eCollection 2016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估关键数据处理步骤，以确保从大量 RNA-seq 数据中可靠地预测基因共表达。

Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data.

机构信息

出版信息

MOTIVATION

RESULTS

CONCLUSION

动机

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献