Suppr超能文献

用于转录组数据模式发现的充分主成分回归

Sufficient principal component regression for pattern discovery in transcriptomic data.

作者信息

Ding Lei, Zentner Gabriel E, McDonald Daniel J

机构信息

Department of Statistics, Indiana University, Bloomington, IN 47405, USA.

Department of Biology, Indiana University, Bloomington, IN 47405, USA.

出版信息

Bioinform Adv. 2022 May 14;2(1):vbac033. doi: 10.1093/bioadv/vbac033. eCollection 2022.

Abstract

MOTIVATION

Methods for the global measurement of transcript abundance such as microarrays and RNA-Seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives or ignore any unknown grouping structures for the features.

RESULTS

We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees.

AVAILABILITY AND IMPLEMENTATION

Code and raw data are freely available at https://github.com/dajmcdon/suffpcr. Package documentation may be viewed at https://dajmcdon.github.io/suffpcr.

CONTACT

daniel@stat.ubc.ca.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

诸如微阵列和RNA测序等用于全局测量转录本丰度的方法会生成数据集,其中测量特征的数量远远超过观测值的数量。因此,从这些数据中提取具有生物学意义且实验上易于处理的见解需要进行高维预测。现有的针对这一挑战的稀疏线性方法已经取得了惊人的成功,但仍存在一些重要问题。这些方法可能无法选择正确的特征,相对于非稀疏方法预测效果较差,或者忽略特征的任何未知分组结构。

结果

我们提出了一种名为SuffPCR的方法,该方法在包括回归和分类在内的高维任务中能产生更好的预测,特别是在具有相关特征的组学典型背景下。SuffPCR首先估计稀疏主成分,然后在恢复的子空间上估计线性模型。由于估计的子空间在特征上是稀疏的,因此得到的预测将仅取决于一小部分基因。SuffPCR在各种模拟和实验转录组数据上表现良好,当满足模型假设时,其性能几乎达到最优。我们还展示了近乎最优的理论保证。

可用性和实现

代码和原始数据可在https://github.com/dajmcdon/suffpcr免费获取。包文档可在https://dajmcdon.github.io/suffpcr查看。

联系方式

daniel@stat.ubc.ca

补充信息

补充数据可在网上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c4df/9710670/d0d2c9518982/vbac033f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验