Huang Xiaohong, Pan Wei, Grindle Suzanne, Han Xinqiang, Chen Yingjie, Park Soon J, Miller Leslie W, Hall Jennifer
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA.
BMC Bioinformatics. 2005 Aug 24;6:205. doi: 10.1186/1471-2105-6-205.
Human heart failure is a complex disease that manifests from multiple genetic and environmental factors. Although ischemic and non-ischemic heart disease present clinically with many similar decreases in ventricular function, emerging work suggests that they are distinct diseases with different responses to therapy. The ability to distinguish between ischemic and non-ischemic heart failure may be essential to guide appropriate therapy and determine prognosis for successful treatment. In this paper we consider discriminating the etiologies of heart failure using gene expression libraries from two separate institutions.
We apply five new statistical methods, including partial least squares, penalized partial least squares, LASSO, nearest shrunken centroids and random forest, to two real datasets and compare their performance for multiclass classification. It is found that the five statistical methods perform similarly on each of the two datasets: it is difficult to correctly distinguish the etiologies of heart failure in one dataset whereas it is easy for the other one. In a simulation study, it is confirmed that the five methods tend to have close performance, though the random forest seems to have a slight edge.
For some gene expression data, several recently developed discriminant methods may perform similarly. More importantly, one must remain cautious when assessing the discriminating performance using gene expression profiles based on a small dataset; our analysis suggests the importance of utilizing multiple or larger datasets.
人类心力衰竭是一种由多种遗传和环境因素导致的复杂疾病。尽管缺血性和非缺血性心脏病在临床上表现出许多相似的心室功能下降,但新出现的研究表明它们是不同的疾病,对治疗的反应也不同。区分缺血性和非缺血性心力衰竭的能力对于指导适当的治疗和确定成功治疗的预后可能至关重要。在本文中,我们考虑使用来自两个不同机构的基因表达文库来区分心力衰竭的病因。
我们将五种新的统计方法,包括偏最小二乘法、惩罚偏最小二乘法、套索法、最近收缩质心法和随机森林法,应用于两个真实数据集,并比较它们在多类分类中的性能。发现这五种统计方法在两个数据集中的每一个上表现相似:在一个数据集中难以正确区分心力衰竭的病因,而在另一个数据集中则很容易。在一项模拟研究中,证实这五种方法的性能往往相近,尽管随机森林法似乎略占优势。
对于某些基因表达数据,几种最近开发的判别方法可能表现相似。更重要的是,在基于小数据集评估基因表达谱的判别性能时必须谨慎;我们的分析表明利用多个或更大数据集的重要性。