Amniouel Soukaina, Yalamanchili Keertana, Sankararaman Sreenidhi, Jafri Mohsin Saleet
School of System Biology, George Mason University, Fairfax, VA 22030, USA.
School of Engineering, Brown University, Providence, RI 02912, USA.
BioMedInformatics. 2024 Jun;4(2):1396-1424. doi: 10.3390/biomedinformatics4020077. Epub 2024 May 22.
Ovarian cancer (OC) is the most lethal gynecological cancer in the United States. Among the different types of OC, serous ovarian cancer (SOC) stands out as the most prevalent. Transcriptomics techniques generate extensive gene expression data, yet only a few of these genes are relevant to clinical diagnosis.
Methods for feature selection (FS) address the challenges of high dimensionality in extensive datasets. This study proposes a computational framework that applies FS techniques to identify genes highly associated with platinum-based chemotherapy response on SOC patients. Using SOC datasets from the Gene Expression Omnibus (GEO) database, LASSO and varSelRF FS methods were employed. Machine learning classification algorithms such as random forest (RF) and support vector machine (SVM) were also used to evaluate the performance of the models.
The proposed framework has identified biomarkers panels with 9 and 10 genes that are highly correlated with platinum-paclitaxel and platinum-only response in SOC patients, respectively. The predictive models have been trained using the identified gene signatures and accuracy of above 90% was achieved.
In this study, we propose that applying multiple feature selection methods not only effectively reduces the number of identified biomarkers, enhancing their biological relevance, but also corroborates the efficacy of drug response prediction models in cancer treatment.
卵巢癌(OC)是美国致死率最高的妇科癌症。在不同类型的OC中,浆液性卵巢癌(SOC)最为常见。转录组学技术可生成大量基因表达数据,但其中只有少数基因与临床诊断相关。
特征选择(FS)方法应对了海量数据集中高维度的挑战。本研究提出了一个计算框架,该框架应用FS技术来识别与SOC患者铂类化疗反应高度相关的基因。使用来自基因表达综合数据库(GEO)的SOC数据集,采用了套索(LASSO)和变量选择随机森林(varSelRF)FS方法。还使用了随机森林(RF)和支持向量机(SVM)等机器学习分类算法来评估模型的性能。
所提出的框架分别识别出了与SOC患者铂类 - 紫杉醇和单纯铂类反应高度相关的含9个和10个基因的生物标志物组。已使用所识别的基因特征训练了预测模型,准确率达到了90%以上。
在本研究中,我们提出应用多种特征选择方法不仅能有效减少所识别生物标志物的数量,增强其生物学相关性,还能证实药物反应预测模型在癌症治疗中的有效性。