Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal.
INESC TEC - Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal.
BMC Med Inform Decis Mak. 2020 Aug 20;20(Suppl 5):141. doi: 10.1186/s12911-020-01150-w.
As of today, cancer is still one of the most prevalent and high-mortality diseases, summing more than 9 million deaths in 2018. This has motivated researchers to study the application of machine learning-based solutions for cancer detection to accelerate its diagnosis and help its prevention. Among several approaches, one is to automatically classify tumor samples through their gene expression analysis.
In this work, we aim to distinguish five different types of cancer through RNA-Seq datasets: thyroid, skin, stomach, breast, and lung. To do so, we have adopted a previously described methodology, with which we compare the performance of 3 different autoencoders (AEs) used as a deep neural network weight initialization technique. Our experiments consist in assessing two different approaches when training the classification model - fixing the weights after pre-training the AEs, or allowing fine-tuning of the entire network - and two different strategies for embedding the AEs into the classification network, namely by only importing the encoding layers, or by inserting the complete AE. We then study how varying the number of layers in the first strategy, the AEs latent vector dimension, and the imputation technique in the data preprocessing step impacts the network's overall classification performance. Finally, with the goal of assessing how well does this pipeline generalize, we apply the same methodology to two additional datasets that include features extracted from images of malaria thin blood smears, and breast masses cell nuclei. We also discard the possibility of overfitting by using held-out test sets in the images datasets.
The methodology attained good overall results for both RNA-Seq and image extracted data. We outperformed the established baseline for all the considered datasets, achieving an average F score of 99.03, 89.95, and 98.84 and an MCC of 0.99, 0.84, and 0.98, for the RNA-Seq (when detecting thyroid cancer), the Malaria, and the Wisconsin Breast Cancer data, respectively.
We observed that the approach of fine-tuning the weights of the top layers imported from the AE reached higher results, for all the presented experiences, and all the considered datasets. We outperformed all the previous reported results when comparing to the established baselines.
截至今日,癌症仍是最普遍且死亡率最高的疾病之一,仅 2018 年就有超过 900 万人死于癌症。这促使研究人员研究将基于机器学习的解决方案应用于癌症检测,以加速其诊断并帮助预防癌症。在几种方法中,一种是通过基因表达分析自动对肿瘤样本进行分类。
在这项工作中,我们旨在通过 RNA-Seq 数据集区分五种不同类型的癌症:甲状腺、皮肤、胃、乳腺和肺。为此,我们采用了之前描述的方法,通过该方法比较了 3 种不同的自动编码器 (AE) 作为深度神经网络权重初始化技术的性能。我们的实验包括评估两种不同的方法来训练分类模型——在预训练 AE 后固定权重,或允许整个网络微调——以及两种将 AE 嵌入分类网络的不同策略,即仅导入编码层,或插入完整的 AE。然后,我们研究了在第一种策略中改变第一层的数量、AE 潜在向量的维度以及数据预处理步骤中的插补技术如何影响网络的整体分类性能。最后,为了评估该管道的泛化能力,我们将相同的方法应用于另外两个包含从疟疾薄血涂片图像和乳腺肿块细胞核中提取的特征的数据集。我们还通过在图像数据集中使用保留的测试集来排除过拟合的可能性。
该方法在 RNA-Seq 和图像提取数据方面均取得了良好的整体结果。我们在所有考虑的数据集上都超过了既定的基线,分别获得了 99.03、89.95 和 98.84 的平均 F 分数和 0.99、0.84 和 0.98 的 MCC,用于 RNA-Seq(检测甲状腺癌)、疟疾和威斯康星州乳腺癌数据。
我们观察到,在所有提出的经验和所有考虑的数据集上,微调从 AE 导入的顶层权重的方法都达到了更高的结果。与既定的基线相比,我们在所有比较中都超过了之前报道的结果。