Vêncio Ricardo Z N, Varuzza Leonardo, de B Pereira Carlos A, Brentani Helena, Shmulevich Ilya
Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA.
BMC Bioinformatics. 2007 Jul 11;8:246. doi: 10.1186/1471-2105-8-246.
Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space.
Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster.
Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.
诸如基因表达序列分析(SAGE)、大规模平行信号测序系统(MPSS)以及基于合成测序的表达序列标签“数字北方”等转录本计数方法,是用于数字基因表达测量的重要高通量技术。与其他计数或投票过程一样,这些测量构成了成分数据,展现出单纯形空间特有的性质,即各成分之和受到限制。这些性质在常规欧几里得空间中不存在,而基于杂交的微阵列数据通常在该空间中建模。因此,常用于微阵列数据分析的模式识别方法对于转录本计数技术生成的数据可能并无信息价值,因为它们忽略了该空间的某些基本性质。
在此我们展示一款软件工具Simcluster,旨在对单纯形空间上的数据进行聚类分析。我们将Simcluster呈现为一个独立的命令行C程序包以及一个用户友好的在线工具。两个版本均可在以下网址获取:http://xerad.systemsbiology.net/simcluster。
Simcluster是依据一个成熟的成分数据分析数学框架设计的,该框架为处理单纯形空间提供了有原则的程序,因而适用于多种情形,包括基于计数的基因表达数据。