计算公共高通量测序数据的质量，以便从序列读取存档中获取合适的子集进行重新分析。

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.

作者信息

Ohta Tazro, Nakazato Takeru, Bono Hidemasa

出版信息

Gigascience. 2017 Jun 1;6(6):1-8. doi: 10.1093/gigascience/gix029.

DOI:10.1093/gigascience/gix029

PMID:28449062

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5459929/

Abstract

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.

摘要

对于公共数据存储库而言，促进存档数据的再利用非常重要。然而，在不断发展的组学科学领域，向公共存储库提交的高通量测序（HTSeq）数据数量日益增加，这使得用户难以从大量搜索结果中选择合适的数据集。存储库用户需要能够设置一个阈值，以减少结果数量，从而获得用于重新分析的高质量数据的合适子集。我们使用质量控制软件FastQC计算了公共数据存储库序列读取存档（SRA）中存档的测序数据质量。我们获得了1171313个实验的质量值，这些值可用于评估数据再利用的适用性。我们还通过整合实验和样本的质量信息与元数据，直观展示了SRA中的数据分布。我们提供了所有存档测序数据的质量信息，这使用户能够获取足够高质量的测序数据用于重新分析。计算出的质量数据以各种格式向公众提供。我们的数据还提供了一个通过第三方为已发表的研究数据添加元数据来提高公共数据再利用的示例。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4997/5459929/e5f82f115a91/gix029fig1.jpg

相似文献

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.计算公共高通量测序数据的质量，以便从序列读取存档中获取合适的子集进行重新分析。

Gigascience. 2017 Jun 1;6(6):1-8. doi: 10.1093/gigascience/gix029.

The Sequence Read Archive: explosive growth of sequencing data.序列读取档案：测序数据的爆炸式增长。

Nucleic Acids Res. 2012 Jan;40(Database issue):D54-6. doi: 10.1093/nar/gkr854. Epub 2011 Oct 18.

SRAdb: query and use public next-generation sequencing data from within R.SRAdb：在 R 中查询和使用公共下一代测序数据。

BMC Bioinformatics. 2013 Jan 17;14:19. doi: 10.1186/1471-2105-14-19.

"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".METAGENOTE：一个简化的基因组样本元数据注释的网络平台，简化了向 NCBI 的序列读取档案提交的流程。

BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0.

The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.CAIRR 管道用于向国家生物技术信息中心存储库提交符合标准的 B 和 T 细胞受体文库测序研究。

Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.MetaSRA：序列读取档案中标准化的人类样本特定元数据。

Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.

Investigation into the annotation of protocol sequencing steps in the sequence read archive.序列读取存档中协议测序步骤注释的调查。

Gigascience. 2015 May 9;4:23. doi: 10.1186/s13742-015-0064-7. eCollection 2015.

Major submissions tool developments at the European Nucleotide Archive.欧洲核苷酸档案的主要提交工具开发。

Nucleic Acids Res. 2012 Jan;40(Database issue):D43-7. doi: 10.1093/nar/gkr946. Epub 2011 Nov 12.

The sequence read archive.序列读取存档库。

Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9.

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.pysradb：一个用于查询来自NCBI序列读取存档库的下一代测序元数据和数据的Python包。

F1000Res. 2019 Apr 23;8:532. doi: 10.12688/f1000research.18676.1. eCollection 2019.

引用本文的文献

Significance of KLK7 expression, polymorphisms, and function in sheep horn growth.KLK7在绵羊角生长中的表达、多态性及功能的意义

BMC Genomics. 2025 Jan 27;26(1):78. doi: 10.1186/s12864-024-11130-3.

NeuroLINCS Proteomics: Defining human-derived iPSC proteomes and protein signatures of pluripotency.神经 LINCS 蛋白质组学：定义人类诱导多能干细胞蛋白质组和多能性的蛋白质特征。

Sci Data. 2023 Jan 11;10(1):24. doi: 10.1038/s41597-022-01687-7.

Importance of experimental information (metadata) for archived sequence data: case of specific gene bias due to lag time between sample harvest and RNA protection in RNA sequencing.实验信息（元数据）对于存档序列数据的重要性：RNA测序中样本采集与RNA保护之间的时间间隔导致特定基因偏差的情况。

PeerJ. 2021 Aug 25;9:e11875. doi: 10.7717/peerj.11875. eCollection 2021.

Improving tuberculosis surveillance by detecting international transmission using publicly available whole genome sequencing data.利用公开的全基因组测序数据检测国际传播情况，以改善结核病监测。

Euro Surveill. 2021 Jan;26(2). doi: 10.2807/1560-7917.ES.2021.26.2.1900677.

All of gene expression (AOE): An integrated index for public gene expression databases.所有基因表达 (AOE)：公共基因表达数据库的综合指标。

PLoS One. 2020 Jan 24;15(1):e0227076. doi: 10.1371/journal.pone.0227076. eCollection 2020.

VARUS: sampling complementary RNA reads from the sequence read archive.VARUS：从序列读取档案中采样互补 RNA 读取。

BMC Bioinformatics. 2019 Nov 8;20(1):558. doi: 10.1186/s12859-019-3182-x.

Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.积累基因组数据分析工作流程的计算资源使用情况，以优化云计算实例选择。

Gigascience. 2019 Apr 1;8(4). doi: 10.1093/gigascience/giz052.

本文引用的文献

Coming of age: ten years of next-generation sequencing technologies.成年：下一代测序技术的十年

Nat Rev Genet. 2016 May 17;17(6):333-51. doi: 10.1038/nrg.2016.49.

DNA data bank of Japan (DDBJ) progress report.日本DNA数据库（DDBJ）进展报告。

Nucleic Acids Res. 2016 Jan 4;44(D1):D51-7. doi: 10.1093/nar/gkv1105. Epub 2015 Nov 17.

Investigation into the annotation of protocol sequencing steps in the sequence read archive.序列读取存档中协议测序步骤注释的调查。

Gigascience. 2015 May 9;4:23. doi: 10.1186/s13742-015-0064-7. eCollection 2015.

Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive.基于实验设计的高通量测序数据在序列读取档案中的功能挖掘和特征描述。

PLoS One. 2013 Oct 22;8(10):e77910. doi: 10.1371/journal.pone.0077910. eCollection 2013.

The future of DNA sequence archiving.DNA 序列存档的未来。

Gigascience. 2012 Jul 12;1(1):2. doi: 10.1186/2047-217X-1-2.

Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics.Biogem：一种基于工具的有效方法，可用于扩大生物信息学中开源软件开发的规模。

Bioinformatics. 2012 Apr 1;28(7):1035-7. doi: 10.1093/bioinformatics/bts080. Epub 2012 Feb 12.

Toward interoperable bioscience data.迈向可互操作的生物科学数据

Nat Genet. 2012 Jan 27;44(2):121-6. doi: 10.1038/ng.1054.

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.NCBI 的 BioProject 和 BioSample 数据库：促进元数据的捕获和组织。

Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63. doi: 10.1093/nar/gkr1163. Epub 2011 Dec 1.

The Sequence Read Archive: explosive growth of sequencing data.序列读取档案：测序数据的爆炸式增长。

Nucleic Acids Res. 2012 Jan;40(Database issue):D54-6. doi: 10.1093/nar/gkr854. Epub 2011 Oct 18.

BioRuby: bioinformatics software for the Ruby programming language.BioRuby：用于 Ruby 编程语言的生物信息学软件。

Bioinformatics. 2010 Oct 15;26(20):2617-9. doi: 10.1093/bioinformatics/btq475. Epub 2010 Aug 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

计算公共高通量测序数据的质量，以便从序列读取存档中获取合适的子集进行重新分析。

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive.

作者信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献