School of Computer Science, South China Normal University, Guangzhou 510631, China; Key Lab on Cloud Security and Assessment technology of Guangzhou, Guangzhou 510631, China; SCNU & VeChina Joint Lab on BlockChain Technology and Application, Guangzhou 510631, China.
Medical Genetic Centre and Maternal and Children Metabolic-Genetic Key Laboratory, Guangdong Women and Children Hospital, Guangzhou 511400, China.
Comput Biol Chem. 2022 Oct;100:107731. doi: 10.1016/j.compbiolchem.2022.107731. Epub 2022 Jul 16.
Chromosome karyotyping analysis is a vital cytogenetics technique for diagnosing genetic and congenital malformations, analyzing gestational and implantation failures, etc. Since the chromosome classification as an essential stage in chromosome karyotype analysis is a highly time-consuming, tedious, and error-prone task, which requires a large amount of manual work of experienced cytogenetics experts. Many deep learning-based methods have been proposed to address the chromosome classification issues. However, two challenges still remain in current chromosome classification methods. First, most existing methods were developed by different private datasets, making these methods difficult to compare with each other on the same base. Second, due to the absence of reproducing details of most existing methods, these methods are difficult to be applied in clinical chromosome classification applications widely. To address the above challenges in the chromosome classification issue, this work builds and publishes a massive clinical dataset. This dataset enables the benchmarking and building chromosome classification baselines suitable for different scenarios. The massive clinical dataset consists of 126,453 privacy preserving G-band chromosome instances from 2763 karyotypes of 408 individuals. To our best knowledge, it is the first work to collect, annotate, and release a publicly available clinical chromosome classification dataset whose data size scale is also over 120,000. Meanwhile, the experimental results show that the proposed dataset can boost performance of existing chromosome classification models at a varied range of degrees, with the highest accuracy improvement by 5.39 % points. Moreover, the best baseline with 99.33 % accuracy reports state-of-the-art classification performance. The clinical dataset and state-of-the-art baselines can be found at https://github.com/CloudDataLab/BenchmarkForChromosomeClassification.
染色体核型分析是诊断遗传和先天性畸形、分析妊娠和着床失败等的重要细胞遗传学技术。由于染色体分类作为染色体核型分析的一个重要阶段是一个非常耗时、乏味和容易出错的任务,需要大量有经验的细胞遗传学专家的人工工作。已经提出了许多基于深度学习的方法来解决染色体分类问题。然而,目前的染色体分类方法仍然存在两个挑战。首先,大多数现有的方法都是由不同的私有数据集开发的,这使得这些方法很难在同一基础上相互比较。其次,由于大多数现有方法缺乏重现细节,这些方法很难在临床染色体分类应用中广泛应用。为了解决染色体分类问题中的上述挑战,本工作构建并发布了一个大规模的临床数据集。该数据集能够对不同场景下的基准测试和构建染色体分类基线。该大规模临床数据集由来自 408 个人的 2763 个核型的 126453 个隐私保护 G 带染色体实例组成。据我们所知,这是首次收集、注释和发布公开可用的临床染色体分类数据集,其数据规模也超过 12 万。同时,实验结果表明,所提出的数据集可以在不同程度上提高现有染色体分类模型的性能,最高精度提高了 5.39%。此外,以 99.33%的准确率报告的最佳基线达到了分类性能的最新水平。临床数据集和最新的基线可以在 https://github.com/CloudDataLab/BenchmarkForChromosomeClassification 找到。