C Lavanya, S Pooja, Kashyap Abhay H, Rahaman Abdur, Niranjan Swarna, Niranjan Vidya
Department of Biotechnology, RV College of Engineering, Bengaluru, Karnataka, India.
Department of Computer Science and Engineering, RV College of Engineering, Bengaluru, Karnataka, India.
Cancer Inform. 2023 Apr 21;22:11769351231167992. doi: 10.1177/11769351231167992. eCollection 2023.
Lung cancer is considered the most common and the deadliest cancer type. Lung cancer could be mainly of 2 types: small cell lung cancer and non-small cell lung cancer. Non-small cell lung cancer is affected by about 85% while small cell lung cancer is only about 14%. Over the last decade, functional genomics has arisen as a revolutionary tool for studying genetics and uncovering changes in gene expression. RNA-Seq has been applied to investigate the rare and novel transcripts that aid in discovering genetic changes that occur in tumours due to different lung cancers. Although RNA-Seq helps to understand and characterise the gene expression involved in lung cancer diagnostics, discovering the biomarkers remains a challenge. Usage of classification models helps uncover and classify the biomarkers based on gene expression levels over the different lung cancers. The current research concentrates on computing transcript statistics from gene transcript files with a normalised fold change of genes and identifying quantifiable differences in gene expression levels between the reference genome and lung cancer samples. The collected data is analysed, and machine learning models were developed to classify genes as causing NSCLC, causing SCLC, causing both or neither. An exploratory data analysis was performed to identify the probability distribution and principal features. Due to the limited number of features available, all of them were used in predicting the class. To address the imbalance in the dataset, an under-sampling algorithm Near Miss was carried out on the dataset. For classification, the research primarily focused on 4 supervised machine learning algorithms: Logistic Regression, KNN classifier, SVM classifier and Random Forest classifier and additionally, 2 ensemble algorithms were considered: XGboost and AdaBoost. Out of these, based on the weighted metrics considered, the Random Forest classifier showing 87% accuracy was considered to be the best performing algorithm and thus was used to predict the biomarkers causing NSCLC and SCLC. The imbalance and limited features in the dataset restrict any further improvement in the model's accuracy or precision. In our present study using the gene expression values (LogFC, P Value) as the feature sets in the Random Forest Classifier BRAF, KRAS, NRAS, EGFR is predicted to be the possible biomarkers causing NSCLC and ATF6, ATF3, PGDFA, PGDFD, PGDFC and PIP5K1C is predicted to be the possible biomarkers causing SCLC from the transcriptome analysis. It gave a precision of 91.3% and 91% recall after fine tuning. Some of the common biomarkers predicted for NSCLC and SCLC were CDK4, CDK6, BAK1, CDKN1A, DDB2.
肺癌被认为是最常见且最致命的癌症类型。肺癌主要可分为两种类型:小细胞肺癌和非小细胞肺癌。非小细胞肺癌约占85%,而小细胞肺癌仅约占14%。在过去十年中,功能基因组学已成为研究遗传学和揭示基因表达变化的革命性工具。RNA测序已被应用于研究罕见和新的转录本,有助于发现由于不同肺癌导致的肿瘤中发生的基因变化。尽管RNA测序有助于理解和表征肺癌诊断中涉及的基因表达,但发现生物标志物仍然是一个挑战。分类模型的使用有助于根据不同肺癌的基因表达水平发现和分类生物标志物。当前的研究集中于从具有基因归一化倍数变化的基因转录本文件中计算转录本统计数据,并识别参考基因组与肺癌样本之间基因表达水平的可量化差异。对收集的数据进行分析,并开发机器学习模型以将基因分类为导致非小细胞肺癌、导致小细胞肺癌、导致两者或两者都不导致。进行探索性数据分析以识别概率分布和主要特征。由于可用特征数量有限,所有这些特征都用于预测类别。为了解决数据集中的不平衡问题,对数据集执行了欠采样算法Near Miss。对于分类,该研究主要集中在4种监督机器学习算法上:逻辑回归、KNN分类器、支持向量机分类器和随机森林分类器,此外,还考虑了2种集成算法:XGBoost和AdaBoost。其中,基于所考虑的加权指标,显示准确率为87%的随机森林分类器被认为是性能最佳的算法,因此被用于预测导致非小细胞肺癌和小细胞肺癌的生物标志物。数据集中的不平衡和有限特征限制了模型准确率或精确率的进一步提高。在我们目前的研究中,使用基因表达值(LogFC、P值)作为随机森林分类器中的特征集,BRAF、KRAS、NRAS、EGFR被预测为可能导致非小细胞肺癌的生物标志物,而ATF6、ATF3、PGDFA、PGDFD、PGDFC和PIP5K1C被预测为可能导致小细胞肺癌的生物标志物,这是通过转录组分析得出的。经过微调后,其精确率为91.3%,召回率为91%。预测的非小细胞肺癌和小细胞肺癌的一些常见生物标志物为CDK4、CDK6、BAK1、CDKN1A、DDB2。