Bao Shunxing, Weitendorf Frederick D, Plassard Andrew J, Huo Yuankai, Gokhale Aniruddha, Landman Bennett A
Computer Science, Vanderbilt University, Nashville, TN, USA 37235.
Electrical Engineering, Vanderbilt University, Nashville, TN, USA 37235.
Proc SPIE Int Soc Opt Eng. 2017 Feb 11;10138. doi: 10.1117/12.2254712. Epub 2017 Mar 13.
The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e.g., "short" processing times and/or "large" datasets. Recently, an alternative approach using Hadoop and HBase was presented for medical imaging to enable co-location of data storage and computation while minimizing data transfer. The benefits of using such a framework must be formally evaluated against a traditional approach to characterize the point at which simply "large scale" processing transitions into "big data" and necessitates alternative computational frameworks. The proposed Hadoop system was implemented on a production lab-cluster alongside a standard Sun Grid Engine (SGE). Theoretical models for wall-clock time and resource time for both approaches are introduced and validated. To provide real example data, three T1 image archives were retrieved from a university secure, shared web database and used to empirically assess computational performance under three configurations of cluster hardware (using 72, 109, or 209 CPU cores) with differing job lengths. Empirical results match the theoretical models. Based on these data, a comparative analysis is presented for when the Hadoop framework will be relevant and non-relevant for medical imaging.
大数据领域通常关注传统计算范式失效时的处理规模。在医学成像中,传统的大规模处理使用集群计算机,它将一组工作站节点组合成一个由作业调度器控制的功能单元。通常,使用共享存储网络文件系统(NFS)来存储成像数据。然而,当频繁从NFS上传/检索数据时,例如在“短”处理时间和/或“大”数据集的情况下,从存储到处理节点的数据传输会使网络带宽饱和。最近,提出了一种使用Hadoop和HBase的替代方法用于医学成像,以实现数据存储和计算的共置,同时尽量减少数据传输。必须针对传统方法对使用这种框架的好处进行正式评估,以确定简单的“大规模”处理在何时转变为“大数据”并需要替代计算框架。所提出的Hadoop系统与标准的Sun Grid Engine(SGE)一起在生产实验室集群上实现。介绍并验证了两种方法的挂钟时间和资源时间的理论模型。为了提供实际示例数据,从大学安全的共享网络数据库中检索了三个T1图像存档,并用于根据具有不同作业长度的三种集群硬件配置(使用72、109或209个CPU核心)凭经验评估计算性能。实证结果与理论模型相符。基于这些数据,针对Hadoop框架在医学成像中何时相关和不相关进行了比较分析。