Suppr超能文献

使用机器学习进行癌症检测的综述与比较研究:SBERT 和 SimCSE 的应用。

A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application.

机构信息

Department of Computer Science, University of Pretoria, Pretoria, South Africa.

CapeBio TM Technologies, Centurion, South Africa.

出版信息

BMC Bioinformatics. 2023 Mar 23;24(1):112. doi: 10.1186/s12859-023-05235-x.

Abstract

BACKGROUND

Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer.

METHODS

In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings.

RESULTS

The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE's sentence transformer only marginally improved the performance of machine learning models.

摘要

背景

使用视觉、生物和电子健康记录数据作为唯一的输入源,预训练的卷积神经网络和传统的机器学习方法已经被大量用于识别各种恶性肿瘤。最初,需要进行一系列预处理步骤和图像分割步骤,从噪声特征中提取感兴趣区域的特征。然后,将提取的特征应用于几种机器学习和深度学习方法,以检测癌症。

方法

本研究提供了对所有已应用于开发机器学习算法以检测癌症的方法的综述。由于有超过 100 种癌症,本研究仅检查了全球四种最常见和流行的癌症的研究:肺癌、乳腺癌、前列腺癌和结直肠癌。接下来,本研究通过使用最先进的句子转换器,即:SBERT(2019 年)和无监督的 SimCSE(2021 年),提出了一种新的癌症检测方法。该方法仅需要匹配的肿瘤/正常对的原始 DNA 序列作为输入。从 SBERT 和 SimCSE 中学习到的 DNA 表示将被发送到机器学习算法(XGBoost、随机森林、LightGBM 和 CNNs)进行分类。据我们所知,SBERT 和 SimCSE 转换器尚未应用于癌症检测环境中表示 DNA 序列。

结果

使用 SBERT 嵌入的 XGBoost 模型的整体准确率最高,为 73 ± 0.13%,使用 SimCSE 嵌入的 XGBoost 模型的整体准确率为 75 ± 0.12%,是表现最好的分类器。根据这些发现,可以得出结论,将 SimCSE 的句子转换器的句子表示纳入机器学习模型仅略微提高了模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cfb1/10037872/8da19a7059fb/12859_2023_5235_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验