学习高通量测序数据的稀疏对数比。

Learning sparse log-ratios for high-throughput sequencing data.

机构信息

Department of Statistics, Columbia University, New York, NY 10025, USA.

Applied Artificial Intelligence Institute, Deakin University, Geelong, VIC 3126, Australia.

出版信息

Bioinformatics. 2021 Dec 22;38(1):157-163. doi: 10.1093/bioinformatics/btab645.

DOI:10.1093/bioinformatics/btab645

PMID:34498030

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8696089/

Abstract

MOTIVATION

The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets.

RESULTS

Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods.

AVAILABILITY AND IMPLEMENTATION

The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

自动发现与感兴趣的结果相关的稀疏生物标志物是生物信息学的核心目标。在高通量测序 (HTS) 数据的背景下，更一般地在组成数据 (CoDa) 的背景下，一类重要的生物标志物是输入变量之间的对数比。然而，从 HTS 数据中识别预测性对数比生物标志物是一个组合优化问题，这在计算上具有挑战性。现有的方法运行速度慢，并且随着输入维度的增加扩展效果不佳，这限制了它们在低维和中维宏基因组数据集上的应用。

结果

基于深度学习领域的最新进展，我们提出了 CoDaCoRe，这是一种新的学习算法，用于识别稀疏、可解释和预测性的对数比生物标志物。我们的算法利用连续松弛来近似底层组合优化问题。然后可以使用现代机器学习工具包（特别是梯度下降）有效地优化该松弛。结果，CoDaCoRe 的运行速度比竞争方法快几个数量级，同时在预测准确性和稀疏性方面实现了最先进的性能。我们在广泛的微生物组、代谢物和 microRNA 基准数据集以及一个特别高维的数据集上验证了 CoDaCoRe 的卓越性能，对于现有稀疏对数比选择方法来说，该数据集完全在计算上是不可行的。

可用性和实现

CoDaCoRe 包可在 https://github.com/egr95/R-codacore 获得。重现我们结果的代码和说明可在 https://github.com/cunningham-lab/codacore 获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7c9/8696089/53d20fb544ce/btab645f1.jpg

相似文献

Learning sparse log-ratios for high-throughput sequencing data.学习高通量测序数据的稀疏对数比。

Bioinformatics. 2021 Dec 22;38(1):157-163. doi: 10.1093/bioinformatics/btab645.

Large scale microbiome profiling in the cloud.大规模微生物组在云端的分析。

Bioinformatics. 2019 Jul 15;35(14):i13-i22. doi: 10.1093/bioinformatics/btz356.

MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data.MetaShot：一种从鸟枪法宏基因组数据中对宿主相关微生物群进行分类单元分类的精确工作流程。

Bioinformatics. 2017 Jun 1;33(11):1730-1732. doi: 10.1093/bioinformatics/btx036.

Sparse least trimmed squares regression with compositional covariates for high-dimensional data.基于成分协变量的高维数据稀疏最小 trimmed 方回归。

Bioinformatics. 2021 Nov 5;37(21):3805-3814. doi: 10.1093/bioinformatics/btab572.

Analyzing large scale genomic data on the cloud with Sparkhit.使用 Sparkhit 分析云端的大规模基因组数据。

Bioinformatics. 2018 May 1;34(9):1457-1465. doi: 10.1093/bioinformatics/btx808.

Transformation and differential abundance analysis of microbiome data incorporating phylogeny.整合系统发育信息的微生物组数据的转化和差异丰度分析。

Bioinformatics. 2021 Dec 11;37(24):4652-4660. doi: 10.1093/bioinformatics/btab543.

CoCoNet: an efficient deep learning tool for viral metagenome binning.CoCoNet：一种用于病毒宏基因组分箱的高效深度学习工具。

Bioinformatics. 2021 Sep 29;37(18):2803-2810. doi: 10.1093/bioinformatics/btab213.

FastSpar: rapid and scalable correlation estimation for compositional data.FastSpar：用于成分数据的快速可扩展相关估计。

Bioinformatics. 2019 Mar 15;35(6):1064-1066. doi: 10.1093/bioinformatics/bty734.

Poisson hurdle model-based method for clustering microbiome features.基于泊松 hurdle 模型的微生物组特征聚类方法。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac782.

EDCNN: identification of genome-wide RNA-binding proteins using evolutionary deep convolutional neural network.EDCNN：使用进化深度卷积神经网络识别全基因组 RNA 结合蛋白。

Bioinformatics. 2022 Jan 12;38(3):678-686. doi: 10.1093/bioinformatics/btab739.

引用本文的文献

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets.环境宏条形码数据集的特征选择和机器学习方法的基准分析

Comput Struct Biotechnol J. 2025 Apr 16;27:1636-1647. doi: 10.1016/j.csbj.2025.04.017. eCollection 2025.

Gerontologic Biostatistics and Data Science: Aging Research in the Era of Big Data.老年生物统计学与数据科学：大数据时代的衰老研究

J Gerontol A Biol Sci Med Sci. 2024 Dec 11;80(1). doi: 10.1093/gerona/glae269.

Interpretable metric learning in comparative metagenomics: The adaptive Haar-like distance.比较宏基因组学中的可解释度量学习：自适应 Haar 样距离。

PLoS Comput Biol. 2024 May 20;20(5):e1011543. doi: 10.1371/journal.pcbi.1011543. eCollection 2024 May.

Longitudinal gut microbiome changes in immune checkpoint blockade-treated advanced melanoma.免疫检查点阻断治疗晚期黑色素瘤患者的纵向肠道微生物组变化。

Nat Med. 2024 Mar;30(3):785-796. doi: 10.1038/s41591-024-02803-3. Epub 2024 Feb 16.

A toolbox of machine learning software to support microbiome analysis.一个支持微生物组分析的机器学习软件工具箱。

Front Microbiol. 2023 Nov 22;14:1250806. doi: 10.3389/fmicb.2023.1250806. eCollection 2023.

Three approaches to supervised learning for compositional data with pairwise logratios.用于具有成对对数比率的成分数据的监督学习的三种方法。

J Appl Stat. 2022 Aug 6;50(16):3272-3293. doi: 10.1080/02664763.2022.2108007. eCollection 2023.

Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy.在图像卷积网络中对分类群进行排序可提高基于微生物组的机器学习准确性。

Gut Microbes. 2023 Jan-Dec;15(1):2224474. doi: 10.1080/19490976.2023.2224474.

Faecal metabolome and its determinants in inflammatory bowel disease.炎症性肠病的粪便代谢组及其决定因素。

Gut. 2023 Aug;72(8):1472-1485. doi: 10.1136/gutjnl-2022-328048. Epub 2023 Mar 23.

Association of Subjective and Objective Measures of Sleep With Gut Microbiota Composition and Diversity in Older Men: The Osteoporotic Fractures in Men Study.老年人睡眠的主观和客观测量与肠道微生物组成和多样性的关联：男性骨质疏松性骨折研究。

J Gerontol A Biol Sci Med Sci. 2023 Oct 9;78(10):1925-1932. doi: 10.1093/gerona/glad011.

The role of microbial ecology in improving the performance of anaerobic digestion of sewage sludge.微生物生态学在提高污水污泥厌氧消化性能中的作用。

Front Microbiol. 2022 Dec 14;13:1079136. doi: 10.3389/fmicb.2022.1079136. eCollection 2022.

本文引用的文献

Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data.汞齐法：用于成分数据降维的数据驱动融合法。

NAR Genom Bioinform. 2020 Oct 2;2(4):lqaa076. doi: 10.1093/nargab/lqaa076. eCollection 2020 Dec.

Variable selection in microbiome compositional data analysis.微生物组组成数据分析中的变量选择

NAR Genom Bioinform. 2020 May 13;2(2):lqaa029. doi: 10.1093/nargab/lqaa029. eCollection 2020 Jun.

Gut microbiome, big data and machine learning to promote precision medicine for cancer.肠道微生物组、大数据和机器学习促进癌症精准医学。

Nat Rev Gastroenterol Hepatol. 2020 Oct;17(10):635-648. doi: 10.1038/s41575-020-0327-3. Epub 2020 Jul 9.

The Firmicutes/Bacteroidetes Ratio: A Relevant Marker of Gut Dysbiosis in Obese Patients?厚壁菌门/拟杆菌门比值：肥胖患者肠道菌群失调的相关标志物？

Nutrients. 2020 May 19;12(5):1474. doi: 10.3390/nu12051474.

Interpretable Log Contrasts for the Classification of Health Biomarkers: a New Approach to Balance Selection.用于健康生物标志物分类的可解释对数对比：一种平衡选择的新方法。

mSystems. 2020 Apr 7;5(2):e00230-19. doi: 10.1128/mSystems.00230-19.

Profile of the gut microbiota of adults with obesity: a systematic review.肥胖成人肠道微生物组特征：系统评价。

Eur J Clin Nutr. 2020 Sep;74(9):1251-1262. doi: 10.1038/s41430-020-0607-6. Epub 2020 Mar 30.

Interpretable and accurate prediction models for metagenomics data.可解释且准确的宏基因组学数据预测模型。

Gigascience. 2020 Mar 1;9(3). doi: 10.1093/gigascience/giaa010.

A field guide for the compositional analysis of any-omics data.任何组学数据的组成分析指南。

Gigascience. 2019 Sep 1;8(9). doi: 10.1093/gigascience/giz107.

Establishing microbial composition measurement standards with reference frames.建立参考框架的微生物组成测量标准。

Nat Commun. 2019 Jun 20;10(1):2719. doi: 10.1038/s41467-019-10656-5.

Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks.微生物组学习资源库（ML Repo）：一个公开的微生物组回归和分类任务资源库。

Gigascience. 2019 May 1;8(5). doi: 10.1093/gigascience/giz042.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

学习高通量测序数据的稀疏对数比。

Learning sparse log-ratios for high-throughput sequencing data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献