IEEE/ACM Trans Comput Biol Bioinform. 2018 Jul-Aug;15(4):1315-1324. doi: 10.1109/TCBB.2017.2712607. Epub 2017 Jun 6.
Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
准确识别癌症类型对于癌症诊断和治疗至关重要。由于癌症组织和正常组织的基因表达不同,基因表达数据可以作为癌症分类的有效特征源。然而,由于数据样本的固有高维特征和小尺寸,直接使用原始基因表达谱进行准确的癌症分类仍然具有挑战性。我们提出了一种新的基于低秩表示的自训练子空间聚类算法(SSC-LRR),用于基因表达数据上的癌症分类。首先应用低秩表示(LRR)从高维基因表达数据中提取有区别的特征;然后使用自训练子空间聚类(SSC)方法生成癌症分类预测。在与四种最先进的分类方法的两个独立基准数据集上进行了 SSC-LRR 测试。它生成的癌症分类预测的总体准确性为 89.7%,总体相关性为 0.920,分别比最佳对照方法高 18.9%和 24.4%。此外,SSC-LRR 还鉴定了几个基因(RNF114、HLA-DRB5、USP9Y 和 PTPN20)作为新的癌症标识符,值得进一步临床研究。总的来说,这项研究为从大规模基因表达数据中识别癌症分类提供了一条新的敏感途径。