Columbia University Medical Center, PB 1-301, New York, NY, 10032, USA.
J Digit Imaging. 2018 Aug;31(4):513-519. doi: 10.1007/s10278-018-0053-3.
Bone age assessment (BAA) is a commonly performed diagnostic study in pediatric radiology to assess skeletal maturity. The most commonly utilized method for assessment of BAA is the Greulich and Pyle method (Pediatr Radiol 46.9:1269-1274, 2016; Arch Dis Child 81.2:172-173, 1999) atlas. The evaluation of BAA can be a tedious and time-consuming process for the radiologist. As such, several computer-assisted detection/diagnosis (CAD) methods have been proposed for automation of BAA. Classical CAD tools have traditionally relied on hard-coded algorithmic features for BAA which suffer from a variety of drawbacks. Recently, the advent and proliferation of convolutional neural networks (CNNs) has shown promise in a variety of medical imaging applications. There have been at least two published applications of using deep learning for evaluation of bone age (Med Image Anal 36:41-51, 2017; JDI 1-5, 2017). However, current implementations are limited by a combination of both architecture design and relatively small datasets. The purpose of this study is to demonstrate the benefits of a customized neural network algorithm carefully calibrated to the evaluation of bone age utilizing a relatively large institutional dataset. In doing so, this study will aim to show that advanced architectures can be successfully trained from scratch in the medical imaging domain and can generate results that outperform any existing proposed algorithm. The training data consisted of 10,289 images of different skeletal age examinations, 8909 from the hospital Picture Archiving and Communication System at our institution and 1383 from the public Digital Hand Atlas Database. The data was separated into four cohorts, one each for male and female children above the age of 8, and one each for male and female children below the age of 10. The testing set consisted of 20 radiographs of each 1-year-age cohort from 0 to 1 years to 14-15+ years, half male and half female. The testing set included left-hand radiographs done for bone age assessment, trauma evaluation without significant findings, and skeletal surveys. A 14 hidden layer-customized neural network was designed for this study. The network included several state of the art techniques including residual-style connections, inception layers, and spatial transformer layers. Data augmentation was applied to the network inputs to prevent overfitting. A linear regression output was utilized. Mean square error was used as the network loss function and mean absolute error (MAE) was utilized as the primary performance metric. MAE accuracies on the validation and test sets for young females were 0.654 and 0.561 respectively. For older females, validation and test accuracies were 0.662 and 0.497 respectively. For young males, validation and test accuracies were 0.649 and 0.585 respectively. Finally, for older males, validation and test set accuracies were 0.581 and 0.501 respectively. The female cohorts were trained for 900 epochs each and the male cohorts were trained for 600 epochs. An eightfold cross-validation set was employed for hyperparameter tuning. Test error was obtained after training on a full data set with the selected hyperparameters. Using our proposed customized neural network architecture on our large available data, we achieved an aggregate validation and test set mean absolute errors of 0.637 and 0.536 respectively. To date, this is the best published performance on utilizing deep learning for bone age assessment. Our results support our initial hypothesis that customized, purpose-built neural networks provide improved performance over networks derived from pre-trained imaging data sets. We build on that initial work by showing that the addition of state-of-the-art techniques such as residual connections and inception architecture further improves prediction accuracy. This is important because the current assumption for use of residual and/or inception architectures is that a large pre-trained network is required for successful implementation given the relatively small datasets in medical imaging. Instead we show that a small, customized architecture incorporating advanced CNN strategies can indeed be trained from scratch, yielding significant improvements in algorithm accuracy. It should be noted that for all four cohorts, testing error outperformed validation error. One reason for this is that our ground truth for our test set was obtained by averaging two pediatric radiologist reads compared to our training data for which only a single read was used. This suggests that despite relatively noisy training data, the algorithm could successfully model the variation between observers and generate estimates that are close to the expected ground truth.
骨龄评估(BAA)是儿科放射学中常用的一种诊断性研究,用于评估骨骼成熟度。评估 BAA 最常用的方法是 Greulich 和 Pyle 法(Pediatr Radiol 46.9:1269-1274, 2016; Arch Dis Child 81.2:172-173, 1999)图谱。放射科医生对 BAA 的评估可能是一个繁琐且耗时的过程。因此,已经提出了几种计算机辅助检测/诊断(CAD)方法来实现 BAA 的自动化。传统的 CAD 工具在 BAA 方面通常依赖于硬编码的算法特征,这些特征存在多种缺点。最近,卷积神经网络(CNN)的出现和普及在各种医学成像应用中显示出了前景。已经有至少两项使用深度学习评估骨龄的应用(Med Image Anal 36:41-51, 2017; JDI 1-5, 2017)。然而,目前的实现受到架构设计和相对较小数据集的限制。本研究的目的是展示一种精心校准评估骨龄的定制神经网络算法的优势,该算法利用了相对较大的机构数据集。通过这样做,本研究旨在表明先进的架构可以在医学成像领域从头开始成功训练,并可以生成优于任何现有提出算法的结果。训练数据包括来自我们机构的医院图像存档和通信系统(Picture Archiving and Communication System)的 8909 张和公共数字手图谱数据库(Digital Hand Atlas Database)的 1383 张不同骨骼年龄检查的图像,共 10289 张。数据分为四组,一组用于 8 岁以上的男女儿童,一组用于 10 岁以下的男女儿童。测试集由每个 1 岁年龄组的 20 张射线照片组成,从 0 到 1 岁到 14-15+岁,男女各半。测试集包括为骨龄评估、无明显发现的创伤评估和骨骼调查而进行的左手射线照片。本研究设计了一个具有 14 个隐藏层的定制神经网络。该网络包括一些最先进的技术,包括残差风格连接、inception 层和空间变换层。对网络输入进行数据增强以防止过拟合。利用线性回归输出。均方误差被用作网络损失函数,平均绝对误差(MAE)被用作主要性能指标。年轻女性的验证集和测试集的 MAE 准确度分别为 0.654 和 0.561。对于年龄较大的女性,验证集和测试集的准确度分别为 0.662 和 0.497。对于年轻男性,验证集和测试集的准确度分别为 0.649 和 0.585。最后,对于年龄较大的男性,验证集和测试集的准确度分别为 0.581 和 0.501。女性队列每个队列训练 900 个周期,男性队列每个队列训练 600 个周期。采用八折交叉验证集进行超参数调整。使用选定的超参数在全数据集上训练后获得测试误差。在我们可用的大型数据上使用我们提出的定制神经网络架构,我们分别获得了 0.637 和 0.536 的综合验证集和测试集平均绝对误差。到目前为止,这是利用深度学习评估骨龄的最佳发表性能。我们的结果支持我们的初始假设,即定制的、专门构建的神经网络提供了比源自预训练成像数据集的网络更好的性能。我们通过展示添加最新技术(如残差连接和 inception 架构)如何进一步提高预测准确性来扩展这项初始工作。这很重要,因为目前对于使用残差和/或 inception 架构的假设是,由于医学成像中的相对较小数据集,需要一个大型的预训练网络才能成功实施。相反,我们表明,一个小型的、定制的架构,结合先进的 CNN 策略,确实可以从头开始进行训练,从而显著提高算法的准确性。应该注意的是,对于所有四个队列,测试误差都优于验证误差。原因之一是我们的测试集的真实值是通过平均两个儿科放射科医生的读数获得的,而我们的训练数据仅使用了一个读数。这表明,尽管训练数据存在较大噪声,但该算法可以成功模拟观察者之间的差异,并生成接近预期真实值的估计值。