Suppr超能文献

基于组合数据的监督学习和模型分析。

Supervised learning and model analysis with compositional data.

机构信息

Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark.

Helmholtz Munich, Munich, Germany.

出版信息

PLoS Comput Biol. 2023 Jun 30;19(6):e1011240. doi: 10.1371/journal.pcbi.1011240. eCollection 2023 Jun.

Abstract

Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.

摘要

监督学习,如回归和分类,是分析现代高通量测序数据的重要工具,例如在微生物组研究中。然而,由于组成性和稀疏性,现有的技术往往不够充分。要么它们依赖于线性对数对比模型的扩展(该模型可以调整组成性,但无法解释复杂信号或稀疏性),要么它们基于黑盒机器学习方法(这些方法可能会捕获有用的信号,但由于组成性而缺乏可解释性)。我们提出了 KernelBiome,这是一种针对组成数据的基于核的非参数回归和分类框架。它是为稀疏组成数据量身定制的,能够结合先验知识,如系统发育结构。KernelBiome 能够捕获复杂的信号,包括在零结构中,同时自动适应模型的复杂性。我们在 33 个公开的微生物组数据集上展示了与最先进的机器学习方法相比具有竞争力或改进的预测性能。此外,我们的框架提供了两个关键优势:(i)我们提出了两个新的数量来解释各个组件的贡献,并证明它们一致地估计条件均值的平均扰动效应,从而将线性对数对比系数的可解释性扩展到非参数模型。(ii)我们表明核和距离之间的联系有助于可解释性,并提供了一个数据驱动的嵌入,可以增强进一步的分析。KernelBiome 作为一个开源的 Python 包在 PyPI 上可用,并在 https://github.com/shimenghuang/KernelBiome 上提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/506d/10343141/4160d51edca0/pcbi.1011240.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验