一种灵活、可解释且准确的方法，用于推断未测量基因的表达。

A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.

机构信息

Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.

Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.

出版信息

Nucleic Acids Res. 2020 Dec 2;48(21):e125. doi: 10.1093/nar/gkaa881.

DOI:10.1093/nar/gkaa881

PMID:33074331

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7708069/

Abstract

While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96-570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.

摘要

虽然有超过 200 万个人类微阵列基因表达谱可供公开使用，但这些谱是使用各种平台测量的，每种平台都涵盖了预先定义的、有限的基因集。因此，重新分析和整合这个大规模数据集的关键是能够通过推断未测量基因的表达来计算重建部分测量的微阵列样品中的完整转录组的方法。当前最先进的推断方法是针对特定平台的样本量身定制的，并且无论目标样本的生物学背景如何，都依赖于基因-基因关系。我们表明，稀疏回归模型可以捕获样本-样本关系（称为 SampleLASSO），该模型为每个要推断的新目标样本实时构建，优于基于固定基因关系的模型。涉及三种机器学习算法（LASSO、k-最近邻和深度神经网络）、两个基因子集（GPL96-570 和 LINCS）和多个推断任务（在微阵列/RNA-seq 数据集内和跨数据集）的广泛评估表明，SampleLASSO 是最准确的模型。此外，我们通过表明对于从特定组织推断目标样本，SampleLASSO 自动利用来自同一组织的训练样本，证明了这种方法的生物学可解释性。因此，SampleLASSO 是一种简单、强大且灵活的方法，可用于协调大规模基因表达数据。