Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil.
Institute of Biological Sciences, Federal University of Minas Gerais, Brazil. Av. Antônio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil.
Gene. 2020 Feb 5;726:144168. doi: 10.1016/j.gene.2019.144168. Epub 2019 Nov 21.
Methods based around statistics and linear algebra have been increasingly used in attempts to address emerging questions in microarray literature. Microarray technology is a long-used tool in the global analysis of gene expression, allowing for the simultaneous investigation of hundreds or thousands of genes in a sample. It is characterized by a low sample size and a large feature number created a non-square matrix, and by the incomplete rank, that can generate countless more solution in classifiers. To avoid the problem of the 'curse of dimensionality' many authors have performed feature selection or reduced the size of data matrix. In this work, we introduce a new logistic regression-based model to classify breast cancer tumor samples based on microarray expression data, including all features of gene expression and without reducing the microarray data matrix. If the user still deems it necessary to perform feature reduction, it can be done after the application of the methodology, still maintaining a good classification. This methodology allowed the correct classification of breast cancer sample data sets from Gene Expression Omnibus (GEO) data series GSE65194, GSE20711, and GSE25055, which contain the microarray data of said breast cancer samples. Classification had a minimum performance of 80% (sensitivity and specificity), and explored all possible data combinations, including breast cancer subtypes. This methodology highlighted genes not yet studied in breast cancer, some of which have been observed in Gene Regulatory Networks (GRNs). In this work we examine the patterns and features of a GRN composed of transcription factors (TFs) in MCF-7 breast cancer cell lines, providing valuable information regarding breast cancer. In particular, some genes whose αi ∗ associated parameter values revealed extreme positive and negative values, and, as such, can be identified as breast cancer prediction genes. We indicate that the PKN2, MKL1, MED23, CUL5 and GLI genes demonstrate a tumor suppressor profile, and that the MTR, ITGA2B, TELO2, MRPL9, MTTL1, WIPI1, KLHL20, PI4KB, FOLR1 and SHC1 genes demonstrate an oncogenic profile. We propose that these may serve as potential breast cancer prediction genes, and should be prioritized for further clinical studies on breast cancer. This new model allows for the assignment of values to the αi ∗ parameters associated with gene expression. It was noted that some αi ∗ parameters are associated with genes previously described as breast cancer biomarkers, as well as other genes not yet studied in relation to this disease.
方法基于统计学和线性代数,已越来越多地被用于解决微阵列文献中出现的新问题。微阵列技术是一种广泛用于全球基因表达分析的工具,允许同时在样本中研究数百或数千个基因。其特点是样本量小,特征数大,形成非方阵,并且不完全等级,这可能会在分类器中产生无数更多的解决方案。为了避免“维度诅咒”的问题,许多作者已经进行了特征选择或减少数据矩阵的大小。在这项工作中,我们引入了一种新的基于逻辑回归的模型,用于基于微阵列表达数据对乳腺癌肿瘤样本进行分类,包括基因表达的所有特征,而不减少微阵列数据矩阵。如果用户仍然认为有必要进行特征降维,那么在应用该方法之后仍然可以进行,同时保持良好的分类。该方法允许正确分类来自基因表达综合数据库(GEO)数据系列 GSE65194、GSE20711 和 GSE25055 的乳腺癌样本数据集,其中包含所述乳腺癌样本的微阵列数据。分类的性能最低为 80%(敏感性和特异性),并探索了所有可能的数据组合,包括乳腺癌亚型。该方法突出了尚未在乳腺癌中研究过的基因,其中一些基因在基因调控网络(GRN)中观察到。在这项工作中,我们研究了 MCF-7 乳腺癌细胞系中由转录因子(TFs)组成的 GRN 的模式和特征,为乳腺癌提供了有价值的信息。特别是,一些基因的αi∗相关参数值显示出极端的正和负数值,因此可以被鉴定为乳腺癌预测基因。我们表明,PKN2、MKL1、MED23、CUL5 和 GLI 基因表现出肿瘤抑制因子的特征,而 MTR、ITGA2B、TELO2、MRPL9、MTTL1、WIPI1、KLHL20、PI4KB、FOLR1 和 SHC1 基因表现出致癌特征。我们认为这些基因可能作为潜在的乳腺癌预测基因,应优先用于进一步的乳腺癌临床研究。该新模型允许对与基因表达相关的αi∗参数赋值。值得注意的是,一些αi∗参数与先前被描述为乳腺癌生物标志物的基因以及与该疾病无关的其他基因相关。