Acharya Sudipta, Saha Sriparna, Nikhil N
IIT Patna, Department of Computer Science and engineering, Patna, India.
IIT Ropar, Department of Computer Science and engineering, Punjab, India.
BMC Bioinformatics. 2017 Nov 22;18(1):513. doi: 10.1186/s12859-017-1933-0.
Classification of biological samples of gene expression data is a basic building block in solving several problems in the field of bioinformatics like cancer and other disease diagnosis and making a proper treatment plan. One big challenge in sample classification is handling large dimensional and redundant gene expression data. To reduce the complexity of handling this high dimensional data, gene/feature selection plays a major role.
The current paper explores the use of biological knowledge acquired from Gene Ontology database in selecting the proper subset of genes which can further participate in clustering of samples. The proposed feature selection technique is unsupervised in nature as it does not utilize any class label information in the process of gene selection. At the end, a multi-objective clustering approach is deployed to cluster the available set of samples in the reduced gene space.
Reported results show that consideration of biological knowledge in gene selection technique not only reduces the feature space dimensionality in great extent but also improves the accuracy of sample classification. The obtained reduced gene space is validated using strong biological significance tests. In order to prove the supremacy of our proposed gene selection based sample clustering technique, a thorough comparative analysis has also been performed with state-of-the-art techniques.
基因表达数据的生物样本分类是解决生物信息学领域中诸如癌症和其他疾病诊断以及制定适当治疗方案等若干问题的基本组成部分。样本分类中的一个重大挑战是处理高维和冗余的基因表达数据。为了降低处理这种高维数据的复杂性,基因/特征选择起着主要作用。
本文探讨了利用从基因本体数据库中获取的生物学知识来选择合适的基因子集,这些基因子集可进一步参与样本聚类。所提出的特征选择技术本质上是无监督的,因为它在基因选择过程中不使用任何类别标签信息。最后,采用多目标聚类方法在降维后的基因空间中对可用样本集进行聚类。
报告结果表明,在基因选择技术中考虑生物学知识不仅能在很大程度上降低特征空间维度,还能提高样本分类的准确性。使用具有强大生物学意义的测试对获得的降维基因空间进行了验证。为了证明我们提出的基于基因选择的样本聚类技术的优越性,还与现有技术进行了全面的比较分析。