• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

MetaSRA:序列读取档案中标准化的人类样本特定元数据。

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.

机构信息

Department of Computer Sciences.

Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, USA.

出版信息

Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.

DOI:10.1093/bioinformatics/btx334
PMID:28535296
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870770/
Abstract

MOTIVATION

The NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA.

RESULTS

We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline.

AVAILABILITY AND IMPLEMENTATION

The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline.

CONTACT

cdewey@biostat.wisc.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

NCBI 的序列读取档案 (SRA) 承诺,如果能够对数据进行汇总分析,将提供巨大的生物学见解;然而,由于与每个样本相关的元数据结构较差,这些数据在很大程度上仍未得到充分利用。提交给 SRA 的规则并没有规定应该使用一组标准化的术语来描述测序数据所源自的生物样本。因此,元数据包括许多同义词、拼写变体和对外部信息源的引用。此外,由于档案中样本数量众多,数据的手动注释仍然难以处理。由于这些原因,很难进行大规模的分析,这些分析研究了 SRA 中存在的不同疾病、组织和细胞类型之间生物分子过程和表型之间的关系。

结果

我们提出了 MetaSRA,这是一个经过规范化的 SRA 人类样本特定元数据数据库,其模式受到 ENCODE 项目元数据组织的启发。该模式涉及将样本映射到生物医学本体论中的术语,用样本类型类别标记每个样本,并提取实值属性。我们通过一个新颖的计算管道自动执行这些任务。

可用性和实现

MetaSRA 可通过可搜索的网络界面和批量下载在 metasra.biostat.wisc.edu 上获得。实现我们的计算管道的软件可在 http://github.com/deweylab/metasra-pipeline 上获得。

联系方式

cdewey@biostat.wisc.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/8510adbe23c4/btx334f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/39c9bf218d5c/btx334f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/e4fc56251232/btx334f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/705f6e328cba/btx334f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/d1ae59e05206/btx334f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/69b7073f4819/btx334f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/8510adbe23c4/btx334f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/39c9bf218d5c/btx334f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/e4fc56251232/btx334f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/705f6e328cba/btx334f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/d1ae59e05206/btx334f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/69b7073f4819/btx334f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a35e/5870770/8510adbe23c4/btx334f6.jpg

相似文献

1
MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.MetaSRA:序列读取档案中标准化的人类样本特定元数据。
Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.
2
Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.基于 Jupyter 笔记本的工具,用于从序列读取档案构建结构化数据集。
F1000Res. 2020 May 19;9:376. doi: 10.12688/f1000research.23180.2. eCollection 2020.
3
"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".METAGENOTE:一个简化的基因组样本元数据注释的网络平台,简化了向 NCBI 的序列读取档案提交的流程。
BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0.
4
pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.pysradb:一个用于查询来自NCBI序列读取存档库的下一代测序元数据和数据的Python包。
F1000Res. 2019 Apr 23;8:532. doi: 10.12688/f1000research.18676.1. eCollection 2019.
5
grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories.grabseqs:从多个下一代测序数据存储库中简单地下载读取和元数据。
Bioinformatics. 2020 Jun 1;36(11):3607-3609. doi: 10.1093/bioinformatics/btaa167.
6
MetaRNA-Seq: An Interactive Tool to Browse and Annotate Metadata from RNA-Seq Studies.MetaRNA-Seq:一个用于浏览和注释RNA测序研究元数据的交互式工具。
Biomed Res Int. 2015;2015:318064. doi: 10.1155/2015/318064. Epub 2015 Aug 25.
7
The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.CAIRR 管道用于向国家生物技术信息中心存储库提交符合标准的 B 和 T 细胞受体文库测序研究。
Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.
8
Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.计算公共高通量测序数据的质量,以便从序列读取存档中获取合适的子集进行重新分析。
Gigascience. 2017 Jun 1;6(6):1-8. doi: 10.1093/gigascience/gix029.
9
The Sequence Read Archive: a decade more of explosive growth.序列读取档案:十年的爆炸式增长。
Nucleic Acids Res. 2022 Jan 7;50(D1):D387-D390. doi: 10.1093/nar/gkab1053.
10
annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA.annonex2embl:将注释的 DNA 序列自动准备批量提交到 ENA。
Bioinformatics. 2020 Jun 1;36(12):3841-3848. doi: 10.1093/bioinformatics/btaa209.

引用本文的文献

1
Uncovering Functional Gene Regulatory Networks in Bulk and Single-Cell Data through Robust Transcription Factor Activity Estimation and Model-Guided Experimental Validation.通过稳健的转录因子活性估计和模型指导的实验验证,揭示批量和单细胞数据中的功能基因调控网络。
bioRxiv. 2025 Jun 13:2025.06.09.658650. doi: 10.1101/2025.06.09.658650.
2
Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database.使用大语言模型提取生物学术语可提高生物样本数据库中元数据的可用性。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf070.
3
Using semantic search to find publicly available gene-expression datasets.

本文引用的文献

1
Ontology-based annotations and semantic relations in large-scale (epi)genomics data.大规模(表观)基因组学数据中基于本体的注释和语义关系。
Brief Bioinform. 2017 May 1;18(3):403-412. doi: 10.1093/bib/bbw036.
2
SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data.SORTA:一种用于生物医学表型数据的基于本体的重新编码和技术注释的系统。
Database (Oxford). 2015 Sep 18;2015. doi: 10.1093/database/bav089. Print 2015.
3
RNASeqMetaDB: a database and web server for navigating metadata of publicly available mouse RNA-Seq datasets.
使用语义搜索来查找公开可用的基因表达数据集。
bioRxiv. 2025 Mar 15:2025.03.13.643153. doi: 10.1101/2025.03.13.643153.
4
A computational framework for extracting biological insights from SRA cancer data.一种用于从SRA癌症数据中提取生物学见解的计算框架。
Sci Rep. 2025 Mar 8;15(1):8117. doi: 10.1038/s41598-025-91781-8.
5
Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata.使用非结构化元数据的可解释模型对公开可用的样本和研究进行注释。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae652.
6
RummaGEO: Automatic mining of human and mouse gene sets from GEO.RummaGEO:从基因表达综合数据库(GEO)自动挖掘人类和小鼠基因集。
Patterns (N Y). 2024 Oct 11;5(10):101072. doi: 10.1016/j.patter.2024.101072.
7
What is the real value of omics data? Enhancing research outcomes and securing long-term data excellence.组学数据的真正价值是什么?提升研究成果,确保数据长期卓越。
Nucleic Acids Res. 2024 Nov 11;52(20):12130-12140. doi: 10.1093/nar/gkae901.
8
Facilitating accessible, rapid, and appropriate processing of ancient metagenomic data with AMDirT.使用 AMDirT 促进古代宏基因组数据的可访问、快速和适当处理。
F1000Res. 2024 May 28;12:926. doi: 10.12688/f1000research.134798.2. eCollection 2023.
9
RummaGEO: Automatic Mining of Human and Mouse Gene Sets from GEO.RummaGEO:从基因表达综合数据库自动挖掘人类和小鼠基因集
bioRxiv. 2024 Apr 13:2024.04.09.588712. doi: 10.1101/2024.04.09.588712.
10
The African Human Microbiome Portal: a public web portal of curated metagenomic metadata.非洲人类微生物组门户:一个经过策展的宏基因组元数据公共网络门户。
Database (Oxford). 2024 Jan 10;2024. doi: 10.1093/database/baad092.
RNASeqMetaDB:一个用于浏览公开可用小鼠RNA测序数据集元数据的数据库和网络服务器。
Bioinformatics. 2015 Dec 15;31(24):4038-40. doi: 10.1093/bioinformatics/btv503. Epub 2015 Aug 30.
4
Ontology application and use at the ENCODE DCC.本体在ENCODE数据协调中心的应用与使用。
Database (Oxford). 2015 Mar 16;2015. doi: 10.1093/database/bav010. Print 2015.
5
Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data.《疾病本体论2015年更新:一个通过疾病数据连接生物医学知识的经过扩展和更新的人类疾病数据库》
Nucleic Acids Res. 2015 Jan;43(Database issue):D1071-8. doi: 10.1093/nar/gku1011. Epub 2014 Oct 27.
6
The Drosophila anatomy ontology.果蝇解剖学本体论。
J Biomed Semantics. 2013 Oct 18;4(1):32. doi: 10.1186/2041-1480-4-32.
7
SRAdb: query and use public next-generation sequencing data from within R.SRAdb:在 R 中查询和使用公共下一代测序数据。
BMC Bioinformatics. 2013 Jan 17;14:19. doi: 10.1186/1471-2105-14-19.
8
NCBI GEO: archive for functional genomics data sets--update.NCBI GEO:功能基因组学数据集存档 - 更新。
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27.
9
The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013.《ChEBI 参考数据库和生物学相关化学本体:2013 年的增强》
Nucleic Acids Res. 2013 Jan;41(Database issue):D456-63. doi: 10.1093/nar/gks1146. Epub 2012 Nov 24.
10
The Units Ontology: a tool for integrating units of measurement in science.单位本体论:一种用于科学中整合度量单位的工具。
Database (Oxford). 2012 Oct 10;2012:bas033. doi: 10.1093/database/bas033. Print 2012.