一种用于比较EST文库中聚类结构的贝叶斯非参数方法。

A Bayesian nonparametric approach for comparing clustering structures in EST libraries.

作者信息

Lijoi Antonio, Mena Ramsés H, Prünster Igor

机构信息

Department of Economics and Quantitative Methods, University of Pavia, Pavia, Italy.

出版信息

J Comput Biol. 2008 Dec;15(10):1315-27. doi: 10.1089/cmb.2008.0043.

DOI:10.1089/cmb.2008.0043

PMID:19040366

Abstract

Inference for Expressed Sequence Tags (ESTs) data is considered. We focus on evaluating the redundancy of a cDNA library and, more importantly, on comparing different libraries on the basis of their clustering structure. The numerical results we achieve allow us to assess the effect of an error correction procedure for EST data and to study the compatibility of single EST libraries with respect to merged ones. The proposed method is based on a Bayesian nonparametric approach that allows to understand the clustering mechanism that generates the observed data. As specific nonparametric model we use the two parameter Poisson-Dirichlet (PD) process. The PD process represents a tractable nonparametric prior which is a natural candidate for modeling data arising from discrete distributions. It allows prediction and testing in order to analyze the clustering structure featured by the data. We show how a full Bayesian analysis can be performed and describe the corresponding computational algorithm.

摘要

考虑对表达序列标签（ESTs）数据进行推断。我们专注于评估cDNA文库的冗余性，更重要的是，基于其聚类结构比较不同的文库。我们获得的数值结果使我们能够评估EST数据纠错程序的效果，并研究单个EST文库与合并文库的兼容性。所提出的方法基于贝叶斯非参数方法，该方法能够理解生成观测数据的聚类机制。作为特定的非参数模型，我们使用双参数泊松 - 狄利克雷（PD）过程。PD过程代表一种易于处理的非参数先验，它是对离散分布产生的数据进行建模的自然候选者。它允许进行预测和测试，以便分析数据所具有的聚类结构。我们展示了如何进行全贝叶斯分析并描述了相应的计算算法。