使用分布式查询在超大型生物数据集上进行知识和主题发现：一个结合非结构化和结构化数据的原型

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

作者信息

Mudunuri Uma S, Khouja Mohamad, Repetski Stephen, Venkataraman Girish, Che Anney, Luke Brian T, Girard F Pascal, Stephens Robert M

机构信息

Advanced Biomedical Computing Center, Information Systems Program, SAIC-Frederick, Inc., Frederick National Laboratory for Cancer Research, Frederick, Maryland, United States of America.

出版信息

PLoS One. 2013 Dec 2;8(12):e80503. doi: 10.1371/journal.pone.0080503. eCollection 2013.

DOI:10.1371/journal.pone.0080503

PMID:24312478

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3846626/

Abstract

As the discipline of biomedical science continues to apply new technologies capable of producing unprecedented volumes of noisy and complex biological data, it has become evident that available methods for deriving meaningful information from such data are simply not keeping pace. In order to achieve useful results, researchers require methods that consolidate, store and query combinations of structured and unstructured data sets efficiently and effectively. As we move towards personalized medicine, the need to combine unstructured data, such as medical literature, with large amounts of highly structured and high-throughput data such as human variation or expression data from very large cohorts, is especially urgent. For our study, we investigated a likely biomedical query using the Hadoop framework. We ran queries using native MapReduce tools we developed as well as other open source and proprietary tools. Our results suggest that the available technologies within the Big Data domain can reduce the time and effort needed to utilize and apply distributed queries over large datasets in practical clinical applications in the life sciences domain. The methodologies and technologies discussed in this paper set the stage for a more detailed evaluation that investigates how various data structures and data models are best mapped to the proper computational framework.

摘要

随着生物医学科学学科不断应用能够产生前所未有的大量嘈杂且复杂生物数据的新技术，显而易见的是，从这些数据中获取有意义信息的现有方法根本无法跟上步伐。为了取得有用的结果，研究人员需要能够高效且有效地整合、存储和查询结构化与非结构化数据集组合的方法。随着我们迈向个性化医疗，将非结构化数据（如医学文献）与大量高度结构化和高通量数据（如来自非常大的队列的人类变异或表达数据）相结合的需求尤为迫切。在我们的研究中，我们使用Hadoop框架研究了一个可能的生物医学查询。我们使用我们开发的原生MapReduce工具以及其他开源和专有工具运行查询。我们的结果表明，大数据领域内的现有技术可以减少在生命科学领域的实际临床应用中对大型数据集进行分布式查询所需的时间和精力。本文讨论的方法和技术为更详细的评估奠定了基础，该评估将研究如何将各种数据结构和数据模型最佳地映射到适当的计算框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0a16/3846626/e795283ac8af/pone.0080503.g001.jpg

相似文献

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.使用分布式查询在超大型生物数据集上进行知识和主题发现：一个结合非结构化和结构化数据的原型

PLoS One. 2013 Dec 2;8(12):e80503. doi: 10.1371/journal.pone.0080503. eCollection 2013.

Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.MapReduce 编程框架在临床大数据分析中的应用：现状与未来趋势。

BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014.

Unstructured medical image query using big data - An epilepsy case study.使用大数据的非结构化医学图像查询——一个癫痫病例研究

J Biomed Inform. 2016 Feb;59:218-26. doi: 10.1016/j.jbi.2015.12.005. Epub 2015 Dec 17.

Maximizing clinical cohort size using free text queries.利用自由文本查询最大化临床队列规模。

Comput Biol Med. 2015 May;60:1-7. doi: 10.1016/j.compbiomed.2015.01.008. Epub 2015 Jan 17.

Big data - smart health strategies. Findings from the yearbook 2014 special theme.大数据——智能健康策略。2014年年鉴特别主题的研究结果。

Yearb Med Inform. 2014 Aug 15;9(1):48-51. doi: 10.15265/IY-2014-0031.

BioFed: federated query processing over life sciences linked open data.BioFed：基于生命科学关联开放数据的联邦查询处理

J Biomed Semantics. 2017 Mar 15;8(1):13. doi: 10.1186/s13326-017-0118-0.

Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce.Hadoop-GIS：一种基于MapReduce的高性能空间数据仓库系统。

Proceedings VLDB Endowment. 2013 Aug;6(11).

BioExtract server--an integrated workflow-enabling system to access and analyze heterogeneous, distributed biomolecular data.BioExtract 服务器--一个集成的工作流程启用系统，用于访问和分析异构的、分布式的生物分子数据。

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):12-24. doi: 10.1109/TCBB.2008.98.

Extracting biomedical events from pairs of text entities.从文本实体对中提取生物医学事件。

BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S8. doi: 10.1186/1471-2105-16-S10-S8. Epub 2015 Jul 13.

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets.PIBAS FedSPARQL：一个用于生物信息学数据集整合与探索的基于网络的平台。

J Biomed Semantics. 2017 Sep 20;8(1):42. doi: 10.1186/s13326-017-0151-z.

引用本文的文献

A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data.一个用于管理结构化异构生物医学数据的可扩展数据访问层。

PLoS One. 2016 Dec 9;11(12):e0168004. doi: 10.1371/journal.pone.0168004. eCollection 2016.

Machine learning for biomedical literature triage.用于生物医学文献分类的机器学习

PLoS One. 2014 Dec 31;9(12):e115892. doi: 10.1371/journal.pone.0115892. eCollection 2014.

Big data - smart health strategies. Findings from the yearbook 2014 special theme.大数据——智能健康策略。2014年年鉴特别主题的研究结果。

Yearb Med Inform. 2014 Aug 15;9(1):48-51. doi: 10.15265/IY-2014-0031.

Big Data Usage Patterns in the Health Care Domain: A Use Case Driven Approach Applied to the Assessment of Vaccination Benefits and Risks. Contribution of the IMIA Primary Healthcare Working Group.医疗保健领域的大数据使用模式：一种应用于疫苗接种益处和风险评估的用例驱动方法。国际医学信息学会初级卫生保健工作组的贡献。

Yearb Med Inform. 2014 Aug 15;9(1):27-35. doi: 10.15265/IY-2014-0016.

Big data in medicine is driving big changes.医学领域的大数据正在推动巨大变革。

Yearb Med Inform. 2014 Aug 15;9(1):14-20. doi: 10.15265/IY-2014-0020.

本文引用的文献

The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection.2013 年核酸研究数据库问题及在线分子生物学数据库资源集合。

Nucleic Acids Res. 2013 Jan;41(Database issue):D1-7. doi: 10.1093/nar/gks1297. Epub 2012 Nov 30.

Molecular classification of gliomas based on whole genome gene expression: a systematic report of 225 samples from the Chinese Glioma Cooperative Group.基于全基因组基因表达的脑胶质瘤分子分类：中国胶质瘤协作组 225 例样本的系统报告。

Neuro Oncol. 2012 Dec;14(12):1432-40. doi: 10.1093/neuonc/nos263. Epub 2012 Oct 22.

SemMedDB: a PubMed-scale repository of biomedical semantic predications.SemMedDB：一个基于 PubMed 规模的生物医学语义断言知识库。

Bioinformatics. 2012 Dec 1;28(23):3158-60. doi: 10.1093/bioinformatics/bts591. Epub 2012 Oct 8.

FluBreaks: early epidemic detection from Google flu trends.流感预警：通过谷歌流感趋势进行早期疫情检测。

J Med Internet Res. 2012 Oct 4;14(5):e125. doi: 10.2196/jmir.2102.

GO2PUB: Querying PubMed with semantic expansion of gene ontology terms.GO2PUB：利用基因本体术语的语义扩展查询PubMed

J Biomed Semantics. 2012 Sep 7;3(1):7. doi: 10.1186/2041-1480-3-7.

Google Flu Trends: correlation with emergency department influenza rates and crowding metrics.谷歌流感趋势：与急诊流感发病率和拥挤度指标的相关性。

Clin Infect Dis. 2012 Feb 15;54(4):463-9. doi: 10.1093/cid/cir883. Epub 2012 Jan 8.

BioMart: driving a paradigm change in biological data management.生物数据管理领域的范式转变推动者——生物集市（BioMart）

Database (Oxford). 2011 Nov 13;2011:bar049. doi: 10.1093/database/bar049. Print 2011.

BioMart: a data federation framework for large collaborative projects.BioMart：一个用于大型协作项目的数据联合框架。

Database (Oxford). 2011 Sep 19;2011:bar038. doi: 10.1093/database/bar038. Print 2011.

CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping.CloudAligner：一种基于MapReduce的快速且功能齐全的序列映射工具。

BMC Res Notes. 2011 Jun 6;4:171. doi: 10.1186/1756-0500-4-171.

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.Hadoop/MapReduce/HBase 框架概述及其在生物信息学中的当前应用。

BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用分布式查询在超大型生物数据集上进行知识和主题发现：一个结合非结构化和结构化数据的原型

Knowledge and theme discovery across very large biological data sets using distributed queries: a prototype combining unstructured and structured data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献