Bushel Pierre R, Wolfinger Russell D, Gibson Greg
National Center for Toxicogenomics, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA.
BMC Syst Biol. 2007 Feb 23;1:15. doi: 10.1186/1752-0509-1-15.
Commonly employed clustering methods for analysis of gene expression data do not directly incorporate phenotypic data about the samples. Furthermore, clustering of samples with known phenotypes is typically performed in an informal fashion. The inability of clustering algorithms to incorporate biological data in the grouping process can limit proper interpretation of the data and its underlying biology.
We present a more formal approach, the modk-prototypes algorithm, for clustering biological samples based on simultaneously considering microarray gene expression data and classes of known phenotypic variables such as clinical chemistry evaluations and histopathologic observations. The strategy involves constructing an objective function with the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of the samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples. The dynamic validity index for numeric data was modified with a category utility measure for determining the number of clusters in the data sets. A cluster's prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The approach is shown to work well with a simulated mixed data set and two real data examples containing numeric and categorical data types. One from a heart disease study and another from acetaminophen (an analgesic) exposure in rat liver that causes centrilobular necrosis.
The modk-prototypes algorithm partitioned the simulated data into clusters with samples in their respective class group and the heart disease samples into two groups (sick and buff denoting samples having pain type representative of angina and non-angina respectively) with an accuracy of 79%. This is on par with, or better than, the assignment accuracy of the heart disease samples by several well-known and successful clustering algorithms. Following modk-prototypes clustering of the acetaminophen-exposed samples, informative genes from the cluster prototypes were identified that are descriptive of, and phenotypically anchored to, levels of necrosis of the centrilobular region of the rat liver. The biological processes cell growth and/or maintenance, amine metabolism, and stress response were shown to discern between no and moderate levels of acetaminophen-induced centrilobular necrosis. The use of well-known and traditional measurements directly in the clustering provides some guarantee that the resulting clusters will be meaningfully interpretable.
常用于分析基因表达数据的聚类方法并未直接纳入样本的表型数据。此外,对具有已知表型的样本进行聚类通常采用非正式的方式。聚类算法在分组过程中无法纳入生物学数据,这可能会限制对数据及其潜在生物学特性的正确解读。
我们提出了一种更正式的方法——modk-原型算法,用于基于同时考虑微阵列基因表达数据和已知表型变量类别(如临床化学评估和组织病理学观察)对生物样本进行聚类。该策略涉及构建一个目标函数,其中数值微阵列和临床化学数据采用欧几里得距离平方和,组织病理学分类值采用简单匹配,以衡量样本的差异。微阵列、临床化学和组织病理学测量使用单独的加权项来控制每个数据域对样本聚类的影响。数值数据的动态有效性指标通过类别效用度量进行修改,以确定数据集中的聚类数量。一个聚类的原型由该组中所有样本数值特征的均值和分类值的众数组成,代表了聚类成员的表型。该方法在一个模拟混合数据集以及两个包含数值和分类数据类型的真实数据示例中表现良好。一个来自心脏病研究,另一个来自大鼠肝脏对乙酰氨基酚(一种镇痛药)暴露导致的小叶中心坏死。
modk-原型算法将模拟数据分成了各自类别组中的聚类,心脏病样本分成了两组(患病组和健康组,分别表示具有心绞痛和非心绞痛代表性疼痛类型的样本),准确率为79%。这与几种知名且成功的聚类算法对心脏病样本的分类准确率相当,甚至更高。在对乙酰氨基酚暴露样本进行modk-原型聚类后,从聚类原型中鉴定出了信息基因,这些基因描述了大鼠肝脏小叶中心区域的坏死水平,并在表型上与之相关。细胞生长和/或维持、胺代谢以及应激反应等生物学过程被证明能够区分对乙酰氨基酚诱导的小叶中心坏死的无和中度水平。在聚类中直接使用知名的传统测量方法,为所得聚类能够有意义地解释提供了一定保证。