构建非结构化GenBank元数据以加强比较生物学研究

Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies.

作者信息

Chen Elizabeth S, Sarkar Indra Neil

机构信息

Center for Clinical and Translational Science.

出版信息

AMIA Jt Summits Transl Sci Proc. 2011;2011:6-10. Epub 2011 Mar 7.

PMID:22211174

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3248757/

Abstract

Within large sequence repositories such as GenBank there is a wealth of metadata providing contextual information that may enhance search and retrieval of relevant sequences for a range of subsequent analyses. One challenge is the use of free-text in these metadata fields where approaches are needed to extract, structure, and encode essential information. The goal of the present study was to explore the feasibility of using a combination of existing resources for annotating unstructured GenBank metadata, initially focusing on the "host" and "isolation_source" fields. This paper summarizes early results for 10 host organisms that include a characterization of associated isolation sources with respect to biomedical ontologies and semantic types. The findings from this preliminary study provide insights to the rich amount of information captured within these unstructured metadata, guidance for addressing the challenges and issues encountered, and highlight the potential value for enriching comparative biological studies towards improving human health.

摘要

在诸如GenBank这样的大型序列数据库中，有大量的元数据提供上下文信息，这些信息可能会增强对一系列后续分析相关序列的搜索和检索。一个挑战是在这些元数据字段中使用自由文本，需要采用方法来提取、构建和编码基本信息。本研究的目的是探讨使用现有资源组合注释非结构化GenBank元数据的可行性，最初重点关注“宿主”和“分离源”字段。本文总结了10种宿主生物的早期结果，包括根据生物医学本体和语义类型对相关分离源的特征描述。这项初步研究的结果为这些非结构化元数据中捕获的丰富信息提供了见解，为应对遇到的挑战和问题提供了指导，并突出了丰富比较生物学研究以改善人类健康的潜在价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/91cd/3248757/e0e904dbd54d/6-tbi_summit_2011f1.jpg

相似文献

Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies.构建非结构化GenBank元数据以加强比较生物学研究

AMIA Jt Summits Transl Sci Proc. 2011;2011:6-10. Epub 2011 Mar 7.

Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts.利用生物医学本体和注释服务来整理来自哺乳动物宿主的微生物组数据。

AMIA Annu Symp Proc. 2010 Nov 13;2010:717-21.

Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses.用于甲型流感病毒大规模蛋白质序列分析的基于规则的知识聚合

BMC Bioinformatics. 2008;9 Suppl 1(Suppl 1):S7. doi: 10.1186/1471-2105-9-S1-S7.

CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata.CEDAR OnDemand：一个基于本体的科学元数据生成的浏览器扩展。

BMC Bioinformatics. 2018 Jul 16;19(1):268. doi: 10.1186/s12859-018-2247-6.

Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs?生物多样性研究中的数据集搜索：数据存储库中的元数据是否反映了学术信息需求？

PLoS One. 2021 Mar 24;16(3):e0246099. doi: 10.1371/journal.pone.0246099. eCollection 2021.

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.一种用于扩展GenBank记录中地理空间元数据的基于规则的高精度提取系统。

J Am Med Inform Assoc. 2016 Sep;23(5):934-41. doi: 10.1093/jamia/ocv172. Epub 2016 Jan 17.

Towards a semantic medical Web: HealthCyberMap's tool for building an RDF metadata base of health information resources based on the Qualified Dublin Core Metadata Set.迈向语义医学网络：HealthCyberMap用于基于合格的都柏林核心元数据集构建健康信息资源RDF元数据库的工具。

Med Sci Monit. 2002 Jul;8(7):MT124-36.

Accurate Approach Towards Efficiency of Searching Agents in Digital Libraries Using Keywords.利用关键词提高数字图书馆中搜索代理的效率的精确方法。

J Med Syst. 2019 May 1;43(6):164. doi: 10.1007/s10916-019-1294-5.

Standardized metadata for human pathogen/vector genomic sequences.人类病原体/载体基因组序列的标准化元数据。

PLoS One. 2014 Jun 17;9(6):e99979. doi: 10.1371/journal.pone.0099979. eCollection 2014.

Obstacles to the reuse of study metadata in ClinicalTrials.gov.临床实验数据库中研究元数据再利用的障碍。

Sci Data. 2020 Dec 18;7(1):443. doi: 10.1038/s41597-020-00780-z.

引用本文的文献

GenBank as a source to monitor and analyze Host-Microbiome data.利用 GenBank 监测和分析宿主-微生物组数据。

Bioinformatics. 2022 Sep 2;38(17):4172-4177. doi: 10.1093/bioinformatics/btac487.

Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research.在GenBank中进行地理空间和宿主元数据的命名实体链接以推进生物医学研究。

Database (Oxford). 2017 Jan 1;2017:bax093. doi: 10.1093/database/bax093.

GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records.地理增强（GeoBoost）：加速涉及病毒基因库记录地理空间元数据的研究。

Bioinformatics. 2018 May 1;34(9):1606-1608. doi: 10.1093/bioinformatics/btx799.

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.一种用于扩展GenBank记录中地理空间元数据的基于规则的高精度提取系统。

J Am Med Inform Assoc. 2016 Sep;23(5):934-41. doi: 10.1093/jamia/ocv172. Epub 2016 Jan 17.

Leveraging biodiversity knowledge for potential phyto-therapeutic applications.利用生物多样性知识寻找潜在的植物治疗应用。

J Am Med Inform Assoc. 2013 Jul-Aug;20(4):668-79. doi: 10.1136/amiajnl-2012-001445. Epub 2013 Mar 21.

本文引用的文献

AMIA Annu Symp Proc. 2010 Nov 13;2010:717-21.

Building a biomedical ontology recommender web service.构建一个生物医学本体推荐网络服务。

J Biomed Semantics. 2010 Jun 22;1 Suppl 1(Suppl 1):S1. doi: 10.1186/2041-1480-1-S1-S1.

MetaBar - a tool for consistent contextual data acquisition and standards compliant submission.MetaBar - 一种用于一致的上下文数据采集和符合标准的提交的工具。

BMC Bioinformatics. 2010 Jun 30;11:358. doi: 10.1186/1471-2105-11-358.

Database resources of the National Center for Biotechnology Information.国家生物技术信息中心数据库资源。

Nucleic Acids Res. 2010 Jan;38(Database issue):D5-16. doi: 10.1093/nar/gkp967. Epub 2009 Nov 12.

GeMInA, Genomic Metadata for Infectious Agents, a geospatial surveillance pathogen database.GeMInA，传染病原基因组元数据，一个地理空间监测病原体数据库。

Nucleic Acids Res. 2010 Jan;38(Database issue):D754-64. doi: 10.1093/nar/gkp832. Epub 2009 Oct 22.

Comparison of concept recognizers for building the Open Biomedical Annotator.比较概念识别器在构建开放生物医学标注器中的应用。

BMC Bioinformatics. 2009 Sep 17;10 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-10-S9-S14.

The importance of biological databases in biological discovery.生物数据库在生物发现中的重要性。

Curr Protoc Bioinformatics. 2009 Sep;Chapter 1:Unit 1.1. doi: 10.1002/0471250953.bi0101s27.

BioPortal: ontologies and integrated data resources at the click of a mouse.生物门户：一键点击即可获取本体和集成数据资源。

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W170-3. doi: 10.1093/nar/gkp440. Epub 2009 May 29.

The minimum information about a genome sequence (MIGS) specification.基因组序列最小信息（MIGS）规范

Nat Biotechnol. 2008 May;26(5):541-7. doi: 10.1038/nbt1360.

Habitat-Lite: a GSC case study based on free text terms for environmental metadata.Habitat-Lite：一个基于环境元数据自由文本术语的地球科学委员会案例研究。

OMICS. 2008 Jun;12(2):129-36. doi: 10.1089/omi.2008.0016.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

构建非结构化GenBank元数据以加强比较生物学研究

Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献