Mudunuri Uma S, Khouja Mohamad, Repetski Stephen, Venkataraman Girish, Che Anney, Luke Brian T, Girard F Pascal, Stephens Robert M
Advanced Biomedical Computing Center, Information Systems Program, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America.
PLoS One. 2013 Dec 2;8(12):e80503. doi: 10.1371/journal.pone.0080503. eCollection 2013.
As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.
随着生物医学科学学科不断应用能够产生前所未有的大量嘈杂且复杂生物数据的新技术,显而易见的是,从这些数据中获取有意义信息的现有方法根本无法跟上步伐。为了取得有用的结果,研究人员需要能够高效且有效地整合、存储和查询结构化与非结构化数据集组合的方法。随着我们迈向个性化医疗,将非结构化数据(如医学文献)与大量高度结构化和高通量数据(如来自非常大的队列的人类变异或表达数据)相结合的需求尤为迫切。在我们的研究中,我们使用Hadoop框架研究了一个可能的生物医学查询。我们使用我们开发的原生MapReduce工具以及其他开源和专有工具运行查询。我们的结果表明,大数据领域内的现有技术可以减少在生命科学领域的实际临床应用中对大型数据集进行分布式查询所需的时间和精力。本文讨论的方法和技术为更详细的评估奠定了基础,该评估将研究如何将各种数据结构和数据模型最佳地映射到适当的计算框架。