一种用于在基因表达序列分析（SAGE）数据中进行可视化驱动的共表达模式发现的序列化方法。

A seriation approach for visualization-driven discovery of co-expression patterns in Serial Analysis of Gene Expression (SAGE) data.

作者信息

Morozova Olena, Morozov Vyacheslav, Hoffman Brad G, Helgason Cheryl D, Marra Marco A

机构信息

Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, Canada.

出版信息

PLoS One. 2008 Sep 12;3(9):e3205. doi: 10.1371/journal.pone.0003205.

DOI:10.1371/journal.pone.0003205

PMID:18787709

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2527533/

Abstract

BACKGROUND

Serial Analysis of Gene Expression (SAGE) is a DNA sequencing-based method for large-scale gene expression profiling that provides an alternative to microarray analysis. Most analyses of SAGE data aimed at identifying co-expressed genes have been accomplished using various versions of clustering approaches that often result in a number of false positives.

PRINCIPAL FINDINGS

Here we explore the use of seriation, a statistical approach for ordering sets of objects based on their similarity, for large-scale expression pattern discovery in SAGE data. For this specific task we implement a seriation heuristic we term 'progressive construction of contigs' that constructs local chains of related elements by sequentially rearranging margins of the correlation matrix. We apply the heuristic to the analysis of simulated and experimental SAGE data and compare our results to those obtained with a clustering algorithm developed specifically for SAGE data. We show using simulations that the performance of seriation compares favorably to that of the clustering algorithm on noisy SAGE data.

CONCLUSIONS

We explore the use of a seriation approach for visualization-based pattern discovery in SAGE data. Using both simulations and experimental data, we demonstrate that seriation is able to identify groups of co-expressed genes more accurately than a clustering algorithm developed specifically for SAGE data. Our results suggest that seriation is a useful method for the analysis of gene expression data whose applicability should be further pursued.

摘要

背景

基因表达序列分析（SAGE）是一种基于DNA测序的大规模基因表达谱分析方法，它为微阵列分析提供了一种替代方法。大多数旨在识别共表达基因的SAGE数据分析都是使用各种版本的聚类方法完成的，这些方法常常会导致大量的假阳性结果。

主要发现

在这里，我们探索使用序列化方法，一种基于对象相似性对对象集进行排序的统计方法，用于在SAGE数据中进行大规模表达模式发现。对于这个特定任务，我们实现了一种序列化启发式方法，我们称之为“重叠群的渐进构建”，它通过依次重新排列相关矩阵的边缘来构建相关元素的局部链。我们将这种启发式方法应用于模拟和实验性SAGE数据的分析，并将我们的结果与专门为SAGE数据开发的聚类算法所获得的结果进行比较。我们通过模拟表明，在有噪声的SAGE数据上，序列化的性能优于聚类算法。