Suppr超能文献

使用分布式查询在超大型生物数据集上进行知识和主题发现:一个结合非结构化和结构化数据的原型

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

作者信息

Mudunuri Uma S, Khouja Mohamad, Repetski Stephen, Venkataraman Girish, Che Anney, Luke Brian T, Girard F Pascal, Stephens Robert M

机构信息

Advanced Biomedical Computing Center, Information Systems Program, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America.

出版信息

PLoS One. 2013 Dec 2;8(12):e80503. doi: 10.1371/journal.pone.0080503. eCollection 2013.

Abstract

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.

摘要

随着生物医学科学学科不断应用能够产生前所未有的大量嘈杂且复杂生物数据的新技术,显而易见的是,从这些数据中获取有意义信息的现有方法根本无法跟上步伐。为了取得有用的结果,研究人员需要能够高效且有效地整合、存储和查询结构化与非结构化数据集组合的方法。随着我们迈向个性化医疗,将非结构化数据(如医学文献)与大量高度结构化和高通量数据(如来自非常大的队列的人类变异或表达数据)相结合的需求尤为迫切。在我们的研究中,我们使用Hadoop框架研究了一个可能的生物医学查询。我们使用我们开发的原生MapReduce工具以及其他开源和专有工具运行查询。我们的结果表明,大数据领域内的现有技术可以减少在生命科学领域的实际临床应用中对大型数据集进行分布式查询所需的时间和精力。本文讨论的方法和技术为更详细的评估奠定了基础,该评估将研究如何将各种数据结构和数据模型最佳地映射到适当的计算框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a16/3846626/e795283ac8af/pone.0080503.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验