Sahoo Satya S, Wei Annan, Valdez Joshua, Wang Li, Zonjy Bilal, Tatsuoka Curtis, Loparo Kenneth A, Lhatoo Samden D
Division of Medical Informatics, School of Medicine, Case Western Reserve UniversityCleveland, OH, USA; Electrical Engineering and Computer Science Department, School of Engineering, Case Western Reserve UniversityCleveland, OH, USA.
Electrical Engineering and Computer Science Department, School of Engineering, Case Western Reserve University Cleveland, OH, USA.
Front Neuroinform. 2016 Jun 6;10:18. doi: 10.3389/fninf.2016.00018. eCollection 2016.
The recent advances in neurological imaging and sensing technologies have led to rapid increase in the volume, rate of data generation, and variety of neuroscience data. This "neuroscience Big data" represents a significant opportunity for the biomedical research community to design experiments using data with greater timescale, large number of attributes, and statistically significant data size. The results from these new data-driven research techniques can advance our understanding of complex neurological disorders, help model long-term effects of brain injuries, and provide new insights into dynamics of brain networks. However, many existing neuroinformatics data processing and analysis tools were not built to manage large volume of data, which makes it difficult for researchers to effectively leverage this available data to advance their research. We introduce a new toolkit called NeuroPigPen that was developed using Apache Hadoop and Pig data flow language to address the challenges posed by large-scale electrophysiological signal data. NeuroPigPen is a modular toolkit that can process large volumes of electrophysiological signal data, such as Electroencephalogram (EEG), Electrocardiogram (ECG), and blood oxygen levels (SpO2), using a new distributed storage model called Cloudwave Signal Format (CSF) that supports easy partitioning and storage of signal data on commodity hardware. NeuroPigPen was developed with three design principles: (a) Scalability-the ability to efficiently process increasing volumes of data; (b) Adaptability-the toolkit can be deployed across different computing configurations; and (c) Ease of programming-the toolkit can be easily used to compose multi-step data processing pipelines using high-level programming constructs. The NeuroPigPen toolkit was evaluated using 750 GB of electrophysiological signal data over a variety of Hadoop cluster configurations ranging from 3 to 30 Data nodes. The evaluation results demonstrate that the toolkit is highly scalable and adaptable, which makes it suitable for use in neuroscience applications as a scalable data processing toolkit. As part of the ongoing extension of NeuroPigPen, we are developing new modules to support statistical functions to analyze signal data for brain connectivity research. In addition, the toolkit is being extended to allow integration with scientific workflow systems. NeuroPigPen is released under BSD license at: https://sites.google.com/a/case.edu/neuropigpen/.
神经成像和传感技术的最新进展导致神经科学数据在数量、数据生成速率和种类上迅速增加。这种“神经科学大数据”为生物医学研究界提供了一个重要机会,使其能够利用具有更长时间尺度、大量属性且数据量具有统计学意义的数据来设计实验。这些新的数据驱动研究技术的结果可以增进我们对复杂神经疾病的理解,帮助模拟脑损伤的长期影响,并为脑网络动力学提供新的见解。然而,许多现有的神经信息学数据处理和分析工具并非为管理大量数据而构建,这使得研究人员难以有效利用这些可用数据来推进他们的研究。我们引入了一个名为NeuroPigPen的新工具包,它是使用Apache Hadoop和Pig数据流语言开发的,以应对大规模电生理信号数据带来的挑战。NeuroPigPen是一个模块化工具包,它可以使用一种名为Cloudwave信号格式(CSF)的新分布式存储模型来处理大量电生理信号数据,如脑电图(EEG)、心电图(ECG)和血氧水平(SpO2),该模型支持在商用硬件上轻松对信号数据进行分区和存储。NeuroPigPen是基于三个设计原则开发的:(a)可扩展性——有效处理不断增加的数据量的能力;(b)适应性——该工具包可以跨不同的计算配置进行部署;(c)易于编程——该工具包可以轻松地使用高级编程结构来构建多步骤数据处理管道。NeuroPigPen工具包在从3到30个数据节点的各种Hadoop集群配置上使用750GB的电生理信号数据进行了评估。评估结果表明,该工具包具有高度的可扩展性和适应性,这使其适合作为可扩展数据处理工具包用于神经科学应用。作为NeuroPigPen正在进行的扩展的一部分,我们正在开发新的模块以支持用于脑连接性研究的信号数据分析的统计功能。此外,该工具包正在扩展以允许与科学工作流系统集成。NeuroPigPen根据BSD许可发布于:https://sites.google.com/a/case.edu/neuropigpen/ 。