Huang Chi-Cheng, Tu Shih-Hsin, Huang Ching-Shui, Lien Heng-Hui, Lai Liang-Chuan, Chuang Eric Y
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, No. 1, Section 4, Roosevelt Road, Taipei 10617, Taiwan ; Cathay General Hospital SiJhih, New Taipei, Taiwan ; School of Medicine, Fu-Jen Catholic University, New Taipei, Taiwan ; School of Medicine, Taipei Medical University, Taipei, Taiwan.
School of Medicine, Taipei Medical University, Taipei, Taiwan ; Department of Surgery, Cathay General Hospital, Taipei, Taiwan.
Biomed Res Int. 2013;2013:248648. doi: 10.1155/2013/248648. Epub 2013 Dec 30.
Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression profiles. Despite recent advancements in machine learning and bioinformatics, most classification tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which five distinct molecular subtypes were identified. The PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. The 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation (n = 535). The agreement between PAM50 centroid-based single sample prediction (SSP) and PLS-regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassified samples by PLS-regression. If these unclassified samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassified samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.
多类预测仍然是诸如微阵列基因表达谱等高通量数据分析的一个障碍。尽管机器学习和生物信息学最近取得了进展,但大多数分类工具仅限于二元响应的应用。我们的目的是将偏最小二乘(PLS)回归应用于乳腺癌内在分类,其中确定了五种不同的分子亚型。PAM50特征基因在PLS分析中用作预测变量,潜在基因成分得分在每种分子亚型的二元逻辑回归中使用。用于PAM50开发的139个典型阵列用作训练数据集,三项来自汉族的独立微阵列研究用于独立验证(n = 535)。在训练样本中,基于PAM50质心的单样本预测(SSP)与PLS回归之间的一致性非常好(加权Kappa:0.988),但在独立样本中大幅下降,这可能归因于PLS回归有更多未分类的样本。如果去除这些未分类的样本,PAM50 SSP与PLS回归之间的一致性会极大提高(加权Kappa:0.829,而分析未分类样本时为0.541)。我们的研究确定了PLS回归在多类预测中的可行性,并且在乳腺癌分子亚型中观察到了不同的临床表现和预后差异。