Suppr超能文献

学习高通量测序数据的稀疏对数比。

Learning sparse log-ratios for high-throughput sequencing data.

机构信息

Department of Statistics, Columbia University, New York, NY 10025, USA.

Applied Artificial Intelligence Institute, Deakin University, Geelong, VIC 3126, Australia.

出版信息

Bioinformatics. 2021 Dec 22;38(1):157-163. doi: 10.1093/bioinformatics/btab645.

Abstract

MOTIVATION

The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets.

RESULTS

Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods.

AVAILABILITY AND IMPLEMENTATION

The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

自动发现与感兴趣的结果相关的稀疏生物标志物是生物信息学的核心目标。在高通量测序 (HTS) 数据的背景下,更一般地在组成数据 (CoDa) 的背景下,一类重要的生物标志物是输入变量之间的对数比。然而,从 HTS 数据中识别预测性对数比生物标志物是一个组合优化问题,这在计算上具有挑战性。现有的方法运行速度慢,并且随着输入维度的增加扩展效果不佳,这限制了它们在低维和中维宏基因组数据集上的应用。

结果

基于深度学习领域的最新进展,我们提出了 CoDaCoRe,这是一种新的学习算法,用于识别稀疏、可解释和预测性的对数比生物标志物。我们的算法利用连续松弛来近似底层组合优化问题。然后可以使用现代机器学习工具包(特别是梯度下降)有效地优化该松弛。结果,CoDaCoRe 的运行速度比竞争方法快几个数量级,同时在预测准确性和稀疏性方面实现了最先进的性能。我们在广泛的微生物组、代谢物和 microRNA 基准数据集以及一个特别高维的数据集上验证了 CoDaCoRe 的卓越性能,对于现有稀疏对数比选择方法来说,该数据集完全在计算上是不可行的。

可用性和实现

CoDaCoRe 包可在 https://github.com/egr95/R-codacore 获得。重现我们结果的代码和说明可在 https://github.com/cunningham-lab/codacore 获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7c9/8696089/53d20fb544ce/btab645f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验