一种具有降维和降维方法的新型深度学习算法：具有随机缺失数据的甲状腺癌诊断。

A novel deep machine learning algorithm with dimensionality and size reduction approaches for feature elimination: thyroid cancer diagnoses with randomly missing data.

机构信息

Adana Alparslan Turkes Science and Technology University, Adana, Turkey.

University of Health Sciences, Adana City Training and Research Hospital, Adana, Turkey.

出版信息

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae344.

DOI:10.1093/bib/bbae344

PMID:39007597

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11247408/

Abstract

Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.

摘要

尽管最近开发了大量的检查工具，甲状腺癌的发病率仍在持续上升。由于甲状腺癌的诊断没有标准和确定的程序，临床医生需要进行各种检查。这个审查过程产生了多维大数据，缺乏通用的方法导致随机分布的缺失（稀疏）数据，这对机器学习算法来说都是巨大的挑战。本文旨在开发一种准确且计算效率高的深度学习算法来诊断甲状腺癌。在这方面，针对学习问题中随机分布的缺失数据引起的奇异性，提出了基于内部和目标相似性的降维和维数约简方法，以选择信息量最大的输入数据集。此外，通过层次聚类算法进行降维，以消除相似度较高的数据样本。训练了四种机器学习算法，并使用未见过的数据进行测试，以验证它们的泛化和鲁棒能力。结果显示，对于未见数据，训练的准确率为 100%，测试的准确率为 83%。在相同条件下，还检查了算法的计算时间效率。