理解测序数据作为组成：展望与回顾。

Understanding sequencing data as compositions: an outlook and review.

机构信息

Bioinformatics Core Research Group, Deakin University, Geelong, Australia.

Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.

出版信息

Bioinformatics. 2018 Aug 15;34(16):2870-2878. doi: 10.1093/bioinformatics/bty175.

DOI:10.1093/bioinformatics/bty175

PMID:29608657

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6084572/

Abstract

MOTIVATION

Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models.

RESULTS

The purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

尽管很少被明确承认，但测序平台生成的计数数据实际上是一种组合，其中每个成分（例如基因或转录本）的丰度只有相对于该样本中的其他成分才有意义。这种特性源于检测技术本身，即每个样本记录的计数数量受到任意总和（即文库大小）的限制。因此，测序数据作为组合数据，存在于非欧几里得空间中，如果不进行归一化或转换，许多传统的分析方法（包括距离度量、相关系数和多元统计模型）都是无效的。

结果

本综述的目的是总结组合数据分析（CoDA）的原理，提供测序数据为何具有组合性的证据，讨论可用于分析测序数据的组合有效方法，并强调该研究领域的未来方向。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

Understanding sequencing data as compositions: an outlook and review.理解测序数据作为组成：展望与回顾。

Bioinformatics. 2018 Aug 15;34(16):2870-2878. doi: 10.1093/bioinformatics/bty175.

It's all relative: analyzing microbiome data as compositions.一切都是相对的：将微生物组数据作为成分进行分析。

Ann Epidemiol. 2016 May;26(5):322-9. doi: 10.1016/j.annepidem.2016.03.003. Epub 2016 Apr 2.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

A field guide for the compositional analysis of any-omics data.任何组学数据的组成分析指南。

Gigascience. 2019 Sep 1;8(9). doi: 10.1093/gigascience/giz107.

A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome.用于肠道微生物组组成分析的测序平台和生物信息学管道的比较。

BMC Microbiol. 2017 Sep 13;17(1):194. doi: 10.1186/s12866-017-1101-8.

Compositional Data Analysis using Kernels in mass cytometry data.在质谱流式细胞术数据中使用核函数进行成分数据分析。

Bioinform Adv. 2022 Feb 11;2(1):vbac003. doi: 10.1093/bioadv/vbac003. eCollection 2022.

Microbiome Datasets Are Compositional: And This Is Not Optional.微生物组数据集具有构成性：这并非可有可无。

Front Microbiol. 2017 Nov 15;8:2224. doi: 10.3389/fmicb.2017.02224. eCollection 2017.

Statistical modeling of sequencing errors in SAGE libraries.SAGE文库中测序错误的统计建模

Bioinformatics. 2004 Aug 4;20 Suppl 1:i31-9. doi: 10.1093/bioinformatics/bth924.

Counts: an outstanding challenge for log-ratio analysis of compositional data in the molecular biosciences.计数：分子生物科学中成分数据对数比分析的一项突出挑战。

NAR Genom Bioinform. 2020 Jun 19;2(2):lqaa040. doi: 10.1093/nargab/lqaa040. eCollection 2020 Jun.

Evaluating heterogeneity in indoor and outdoor air pollution using land-use regression and constrained factor analysis.利用土地利用回归和约束因子分析评估室内和室外空气污染的异质性。

Res Rep Health Eff Inst. 2010 Dec(152):5-80; discussion 81-91.

引用本文的文献

Heatwave-driven persistent microbes threaten the resilience of Mediterranean coral holobionts.热浪驱动的持久性微生物威胁着地中海珊瑚共生体的恢复力。

Environ Microbiome. 2025 Aug 21;20(1):107. doi: 10.1186/s40793-025-00765-8.

Twenty-Four-Hour Compositional Data Analysis in Healthcare: Clinical Potential and Future Directions.医疗保健中的24小时成分数据分析：临床潜力与未来方向

Int J Environ Res Public Health. 2025 Jun 25;22(7):1002. doi: 10.3390/ijerph22071002.

Novel insights into post-myocardial infarction cardiac remodeling through algorithmic detection of cell-type composition shifts.通过对细胞类型组成变化的算法检测，对心肌梗死后心脏重塑有了新的见解。

PLoS Genet. 2025 Jul 24;21(7):e1011807. doi: 10.1371/journal.pgen.1011807. eCollection 2025 Jul.

Machine learning-based mapping of Acidobacteriota and Planctomycetota using 16 S rRNA gene metabarcoding data across soils in Russia.利用16S rRNA基因代谢条形码数据对俄罗斯土壤中的酸杆菌门和浮霉菌门进行基于机器学习的图谱绘制。

Sci Rep. 2025 Jul 3;15(1):23763. doi: 10.1038/s41598-025-08050-x.

Landscape and regulation of mRNA translation in the early C. elegans embryo.秀丽隐杆线虫早期胚胎中mRNA翻译的格局与调控

Cell Rep. 2025 May 30;44(6):115778. doi: 10.1016/j.celrep.2025.115778.

Sinking particles exporting diatoms and Hacrobia predict the magnitude of oceanic POC flux.下沉的携带硅藻和超微型浮游生物的颗粒预示着海洋颗粒有机碳通量的大小。

ISME J. 2025 Jan 2;19(1). doi: 10.1093/ismejo/wraf105.

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.环境宏条形码数据集的特征选择和机器学习方法的基准分析

Comput Struct Biotechnol J. 2025 Apr 16;27:1636-1647. doi: 10.1016/j.csbj.2025.04.017. eCollection 2025.

Pathogenesis and Immunomodulation of Urinary Tract Infections Caused by Uropathogenic .由尿路致病性细菌引起的尿路感染的发病机制与免疫调节

Microorganisms. 2025 Mar 26;13(4):745. doi: 10.3390/microorganisms13040745.

Legacy of Repeated Cultivation Drives Cyclical Microbial Community Development in a Tropical Oxisol Soil.反复耕种的遗留影响驱动热带氧化土中微生物群落的周期性发展。

Microb Ecol. 2025 Apr 16;88(1):30. doi: 10.1007/s00248-025-02530-3.

Unraveling the impact of marine heatwaves on the Eukaryome of the emblematic Mediterranean red coral .揭示海洋热浪对标志性地中海红珊瑚真核生物组的影响。

ISME Commun. 2025 Feb 21;5(1):ycaf035. doi: 10.1093/ismeco/ycaf035. eCollection 2025 Jan.

本文引用的文献

propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.propr：一个使用成分数据分析识别比例丰富特征的 R 包。

Sci Rep. 2017 Nov 24;7(1):16252. doi: 10.1038/s41598-017-16520-0.

A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies.在人类 RNA-seq 研究中检测可变剪接和差异表达的工作流程基准测试。

Brief Bioinform. 2019 Mar 22;20(2):471-481. doi: 10.1093/bib/bbx122.

The Gut Microbiota of Healthy Aged Chinese Is Similar to That of the Healthy Young.健康老年中国人的肠道微生物群与健康年轻人的相似。

mSphere. 2017 Sep 27;2(5). doi: 10.1128/mSphere.00327-17. eCollection 2017 Sep-Oct.

Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets.成分数据的系统发育分解在微生物组数据集中产生谱系水平的关联。

PeerJ. 2017 Feb 9;5:e2969. doi: 10.7717/peerj.2969. eCollection 2017.

Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq.使用RNA测序对人类样本进行差异表达分析的分析流程的实证评估。

BMC Bioinformatics. 2017 Jan 17;18(1):38. doi: 10.1186/s12859-016-1457-z.

Correlation Patterns in Experimental Data Are Affected by Normalization Procedures: Consequences for Data Analysis and Network Inference.实验数据中的相关模式受标准化程序影响：对数据分析和网络推断的影响

J Proteome Res. 2017 Feb 3;16(2):619-634. doi: 10.1021/acs.jproteome.6b00704. Epub 2016 Dec 15.

Simulation-based comprehensive benchmarking of RNA-seq aligners.基于模拟的RNA测序比对工具综合基准测试

Nat Methods. 2017 Feb;14(2):135-139. doi: 10.1038/nmeth.4106. Epub 2016 Dec 12.

Analysis of differential splicing suggests different modes of short-term splicing regulation.差异剪接分析表明存在短期剪接调控的不同模式。

Bioinformatics. 2016 Jun 15;32(12):i147-i155. doi: 10.1093/bioinformatics/btw283.

A benchmark for RNA-seq quantification pipelines.RNA测序定量流程的一个基准。

Genome Biol. 2016 Apr 23;17:74. doi: 10.1186/s13059-016-0940-1.

A survey of best practices for RNA-seq data analysis.RNA测序数据分析的最佳实践调查。

Genome Biol. 2016 Jan 26;17:13. doi: 10.1186/s13059-016-0881-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验