Suppr超能文献

使用零膨胀泊松模型对下一代测序数据进行分类。

Classifying next-generation sequencing data using a zero-inflated Poisson model.

机构信息

College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen 518060, China.

Department of Computer Science, and Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Kowloon Tong, Hong Kong.

出版信息

Bioinformatics. 2018 Apr 15;34(8):1329-1335. doi: 10.1093/bioinformatics/btx768.

Abstract

MOTIVATION

With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros.

RESULTS

In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors.

AVAILABILITY AND IMPLEMENTATION

The software is available at http://www.math.hkbu.edu.hk/∼tongt.

CONTACT

xwan@comp.hkbu.edu.hk or tongt@hkbu.edu.hk.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着高通量技术的发展,RNA 测序(RNA-seq)作为基因表达分析的替代方法(如 RNA 谱分析和分类)越来越受欢迎。利用 RNA-seq 数据识别新患者所属的疾病类型已被认为是医学研究中的一个重要问题。由于 RNA-seq 数据是离散的,因此不能直接应用为微阵列数据分类而开发的统计方法。Witten 于 2011 年提出了泊松线性判别分析(PLDA)来对 RNA-seq 数据进行分类。然而,请注意,在实际的 RNA-seq 或 microRNA 序列数据中,计数数据集通常以过多的零值为特征(即当序列深度不足或长度为 18-30 个核苷酸的小 RNA 时)。因此,需要开发一种新的模型来分析具有过多零值的 RNA-seq 数据。

结果

在本文中,我们提出了一种用于具有过多零值的 RNA-seq 数据的零膨胀泊松逻辑判别分析(ZIPLDA)。新方法假设数据来自两个分布的混合:一个是零值的质量点,另一个遵循泊松分布。然后,我们考虑模型中观察到零值的概率与基因的均值和测序深度之间的逻辑关系。模拟研究表明,在所考虑的广泛设置中,该方法的性能优于或至少与现有方法相当。还分析了两个真实数据集,包括乳腺癌 RNA-seq 数据集和 microRNA-seq 数据集,结果与模拟结果一致,即我们提出的方法优于现有竞争对手。

可用性和实现

软件可在 http://www.math.hkbu.edu.hk/∼tongt 获得。

联系方式

xwan@comp.hkbu.edu.hktongt@hkbu.edu.hk

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验