College of Information Science & Technology, Drexel University, Philadelphia, PA 19104, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):980-91. doi: 10.1109/TCBB.2011.113.
In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a “document,” which has a mixture of functional groups, while each functional group (also known as a “latent topic”) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of “N-mer” features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the “N-mer” features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.
在本文中,我们提出了一种方法,使基于同源性的方法和基于组合的方法能够进一步研究功能核心(即微生物核心和基因核心,相应地)。在所提出的方法中,通过生成式主题建模来实现主要功能组的识别,这能够从未标记的数据中提取有用的信息。我们首先表明,生成式主题模型可用于对基于同源性的方法获得的分类群丰度信息进行建模,并研究微生物核心。该模型将每个样本视为具有功能组混合物的“文档”,而每个功能组(也称为“潜在主题”)是物种的权重混合物。因此,对分类群丰度数据的生成式主题模型进行估计将揭示每个样本中潜在功能(潜在主题)的分布。其次,我们表明,生成式主题模型也可用于研究“N-mer”特征的基因组水平组成(通过基于组合的方法获得的 DNA 子读取)。该模型将每个基因组视为潜在遗传模式(潜在主题)的混合物,而每个功能模式是“N-mer”特征的加权混合物,因此核心基因组的存在可以通过一组共同的 N- mer 特征来指示。在研究潜在主题和基因区域之间的互信息之后,我们对发现的潜在遗传模式的功能作用提供了一个解释。实验结果证明了所提出方法的有效性。