Elsamahy Emad A, Ahmed Asmaa E, Shoala Tahseen, Maghraby Fahima A
College of Computing and Information Technology, Arab Academy for Science, Technology, and Maritime Transport, Cairo, Egypt.
Environmental Biotechnology Department, College of Biotechnology, Misr University for Science and Technology, Giza, 12563, Egypt.
Heliyon. 2024 May 31;10(11):e32279. doi: 10.1016/j.heliyon.2024.e32279. eCollection 2024 Jun 15.
Early cancer detection and treatment depend on the discovery of specific genes that cause cancer. The classification of genetic mutations was initially done manually. However, this process relies on pathologists and can be a time-consuming task. Therefore, to improve the precision of clinical interpretation, researchers have developed computational algorithms that leverage next-generation sequencing technologies for automated mutation analysis. This paper utilized four deep learning classification models with training collections of biomedical texts. These models comprise bidirectional encoder representations from transformers for Biomedical text mining (BioBERT), a specialized language model implemented for biological contexts. Impressive results in multiple tasks, including text classification, language inference, and question answering, can be obtained by simply adding an extra layer to the BioBERT model. Moreover, bidirectional encoder representations from transformers (BERT), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM) have been leveraged to produce very good results in categorizing genetic mutations based on textual evidence. The dataset used in the work was created by Memorial Sloan Kettering Cancer Center (MSKCC), which contains several mutations. Furthermore, this dataset poses a major classification challenge in the Kaggle research prediction competitions. In carrying out the work, three challenges were identified: enormous text length, biased representation of the data, and repeated data instances. Based on the commonly used evaluation metrics, the experimental results show that the BioBERT model outperforms other models with an F1 score of 0.87 and 0.850 MCC, which can be considered as improved performance compared to similar results in the literature that have an F1 score of 0.70 achieved with the BERT model.
早期癌症检测和治疗依赖于导致癌症的特定基因的发现。基因突变的分类最初是人工完成的。然而,这个过程依赖于病理学家,并且可能是一项耗时的任务。因此,为了提高临床解释的准确性,研究人员开发了利用下一代测序技术进行自动突变分析的计算算法。本文使用了四个深度学习分类模型,其训练集为生物医学文本。这些模型包括用于生物医学文本挖掘的双向编码器表征(BioBERT),这是一种为生物背景实现的专门语言模型。通过简单地给BioBERT模型添加一个额外的层,就可以在包括文本分类、语言推理和问答在内的多个任务中获得令人印象深刻的结果。此外,双向编码器表征(BERT)、长短期记忆(LSTM)和双向LSTM(BiLSTM)已被用于在基于文本证据对基因突变进行分类方面取得非常好的结果。该工作中使用的数据集由纪念斯隆凯特琳癌症中心(MSKCC)创建,其中包含多种突变。此外,这个数据集在Kaggle研究预测竞赛中带来了重大的分类挑战。在开展这项工作时,识别出了三个挑战:文本长度巨大、数据的偏差表征以及重复的数据实例。基于常用的评估指标,实验结果表明,BioBERT模型的F1分数为0.87,马修斯相关系数(MCC)为0.850,优于其他模型,与文献中使用BERT模型获得的F1分数为0.70的类似结果相比,可以认为是性能有所提升。