Department of Computer Science and Engineering, Chandigarh University, Ajitgarh, Punjab, India.
Digital Zhejiang Technology Operations Co., Ltd., Hangzhou, China.
J Healthc Eng. 2021 Jul 27;2021:8689873. doi: 10.1155/2021/8689873. eCollection 2021.
A cancer tumour consists of thousands of genetic mutations. Even after advancement in technology, the task of distinguishing genetic mutations, which act as driver for the growth of tumour with passengers (Neutral Genetic Mutations), is still being done manually. This is a time-consuming process where pathologists interpret every genetic mutation from the clinical evidence manually. These clinical shreds of evidence belong to a total of nine classes, but the criterion of classification is still unknown. The main aim of this research is to propose a multiclass classifier to classify the genetic mutations based on clinical evidence (i.e., the text description of these genetic mutations) using Natural Language Processing (NLP) techniques. The dataset for this research is taken from Kaggle and is provided by the Memorial Sloan Kettering Cancer Center (MSKCC). The world-class researchers and oncologists contribute the dataset. Three text transformation models, namely, CountVectorizer, TfidfVectorizer, and Word2Vec, are utilized for the conversion of text to a matrix of token counts. Three machine learning classification models, namely, Logistic Regression (LR), Random Forest (RF), and XGBoost (XGB), along with the Recurrent Neural Network (RNN) model of deep learning, are applied to the sparse matrix (keywords count representation) of text descriptions. The accuracy score of all the proposed classifiers is evaluated by using the confusion matrix. Finally, the empirical results show that the RNN model of deep learning has performed better than other proposed classifiers with the highest accuracy of 70%.
癌症肿瘤由数千个基因突变组成。即使在技术进步之后,区分作为肿瘤生长驱动因素的基因突变(驱动基因突变)与乘客基因突变(中性基因突变)的任务仍然需要手动完成。这是一个耗时的过程,病理学家需要手动解释来自临床证据的每一个基因突变。这些临床证据碎片属于总共九个类别,但分类标准仍然未知。这项研究的主要目的是提出一个多类分类器,使用自然语言处理(NLP)技术根据临床证据(即这些基因突变的文本描述)对基因突变进行分类。该研究的数据集来自 Kaggle,并由纪念斯隆凯特琳癌症中心(MSKCC)提供。世界级的研究人员和肿瘤学家为数据集做出了贡献。我们使用了三种文本转换模型,即计数向量器(CountVectorizer)、词频-逆文档频率向量器(TfidfVectorizer)和词向量模型(Word2Vec),将文本转换为标记计数矩阵。我们应用了三种机器学习分类模型,即逻辑回归(LR)、随机森林(RF)和 XGBoost(XGB),以及深度学习的循环神经网络(RNN)模型,到文本描述的稀疏矩阵(关键词计数表示)。通过混淆矩阵评估所有提出的分类器的准确性得分。最后,实证结果表明,深度学习的 RNN 模型的表现优于其他提出的分类器,具有最高的 70%的准确率。