Gurbaxani Brian M, Jones James F, Goertzel Benjamin N, Maloney Elizabeth M
1Centers for Disease Control and Prevention, 600 Clifton Road, MS A-15, Atlanta, GA 30333, USA.
Pharmacogenomics. 2006 Apr;7(3):455-65. doi: 10.2217/14622416.7.3.455.
To provide a mathematical introduction to the Wichita (KS, USA) clinical dataset, which is all of the nongenetic data (no microarray or single nucleotide polymorphism data) from the 2-day clinical evaluation, and show the preliminary findings and limitations, of popular, matrix algebra-based data mining techniques.
An initial matrix of 440 variables by 227 human subjects was reduced to 183 variables by 164 subjects. Variables were excluded that strongly correlated with chronic fatigue syndrome (CFS) case classification by design (for example, the multidimensional fatigue inventory [MFI] data), that were otherwise self reporting in nature and also tended to correlate strongly with CFS classification, or were sparse or nonvarying between case and control. Subjects were excluded if they did not clearly fall into well-defined CFS classifications, had comorbid depression with melancholic features, or other medical or psychiatric exclusions. The popular data mining techniques, principle components analysis (PCA) and linear discriminant analysis (LDA), were used to determine how well the data separated into groups. Two different feature selection methods helped identify the most discriminating parameters.
Although purely biological features (variables) were found to separate CFS cases from controls, including many allostatic load and sleep-related variables, most parameters were not statistically significant individually. However, biological correlates of CFS, such as heart rate and heart rate variability, require further investigation.
Feature selection of a limited number of variables from the purely biological dataset produced better separation between groups than a PCA of the entire dataset. Feature selection highlighted the importance of many of the allostatic load variables studied in more detail by Maloney and colleagues in this issue [1] , as well as some sleep-related variables. Nonetheless, matrix linear algebra-based data mining approaches appeared to be of limited utility when compared with more sophisticated nonlinear analyses on richer data types, such as those found in Maloney and colleagues [1] and Goertzel and colleagues [2] in this issue.
对威奇托(美国堪萨斯州)临床数据集进行数学介绍,该数据集包含为期两天临床评估中的所有非遗传数据(无微阵列或单核苷酸多态性数据),并展示基于矩阵代数的常用数据挖掘技术的初步发现和局限性。
将一个由440个变量和227名人类受试者组成的初始矩阵简化为183个变量和164名受试者。排除那些按设计与慢性疲劳综合征(CFS)病例分类高度相关的变量(例如,多维疲劳量表[MFI]数据),那些本质上为自我报告且也往往与CFS分类高度相关的变量,或者在病例组和对照组之间稀疏或无变化的变量。如果受试者未明确归入明确的CFS分类,患有伴有抑郁特征的抑郁症,或有其他医学或精神科排除标准,则将其排除。使用常用的数据挖掘技术,主成分分析(PCA)和线性判别分析(LDA),来确定数据在分组方面的表现。两种不同的特征选择方法有助于识别最具区分性的参数。
尽管发现纯生物学特征(变量)能将CFS病例与对照组区分开来,包括许多应激负荷和睡眠相关变量,但大多数参数单独来看并无统计学意义。然而,CFS的生物学关联因素,如心率和心率变异性,需要进一步研究。
从纯生物学数据集中选择有限数量的变量进行特征选择,比整个数据集的PCA能在组间产生更好的区分效果。特征选择突出了马隆尼及其同事在本期[1]中更详细研究的许多应激负荷变量以及一些睡眠相关变量的重要性。尽管如此,与对更丰富数据类型进行的更复杂非线性分析(如本期马隆尼及其同事[1]以及戈策尔及其同事[2]中的分析)相比,基于矩阵线性代数的数据挖掘方法似乎效用有限。