• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HTSinfer:从大量Illumina RNA测序文库中推断元数据。

HTSinfer: inferring metadata from bulk Illumina RNA-Seq libraries.

作者信息

Balajti Máté, Kandhari Rohan, Jurič Boris, Zavolan Mihaela, Kanitz Alexander

机构信息

Biozentrum, University of Basel, Basel 4056, Switzerland.

Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.

出版信息

Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf076.

DOI:10.1093/bioinformatics/btaf076
PMID:39969909
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11889452/
Abstract

SUMMARY

The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3' adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.

AVAILABILITY AND IMPLEMENTATION

HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at https://github.com/zavolanlab/htsinfer, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.

摘要

摘要

序列读取存档库是最大且增长最快的测序数据存储库之一,包含数十PB的测序读数。其数据被广泛的科学界使用,通常超出了产生这些数据的初始研究范畴。此类分析依赖于有关实验类型、文库以及测序读数所源自生物体的准确元数据。这些元数据通常由贡献者手动输入,过程容易出错且常常不完整。此外,用于验证描述文库的元数据的一致性和完整性以促进数据重用的易用计算工具在很大程度上并不存在。在此,我们介绍HTSinfer,这是一种基于Python的工具,可直接且仅从Illumina平台上生成的批量RNA测序数据中推断元数据。HTSinfer利用基因组序列信息和诊断基因来快速准确地推断文库来源、文库类型以及相对读取方向、3'接头序列和读取长度统计信息。HTSinfer以模块化方式编写,根据允许的自由和开源许可发布,并鼓励社区贡献,从而能够轻松添加新功能,例如用于推断其他指标,或支持不同的实验类型或测序平台。

可用性与实现方式

HTSinfer根据Apache许可证2.0发布。最新代码可通过GitHub获取,网址为https://github.com/zavolanlab/htsinfer,而版本发布在Bioconda上。本文中描述的HTSinfer版本的快照已存于Zenodo,链接为10.5281/zenodo.13985958。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7021/11889452/1fed351d97d8/btaf076f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7021/11889452/1fed351d97d8/btaf076f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7021/11889452/1fed351d97d8/btaf076f1.jpg

相似文献

1
HTSinfer: inferring metadata from bulk Illumina RNA-Seq libraries.HTSinfer:从大量Illumina RNA测序文库中推断元数据。
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf076.
2
grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories.grabseqs:从多个下一代测序数据存储库中简单地下载读取和元数据。
Bioinformatics. 2020 Jun 1;36(11):3607-3609. doi: 10.1093/bioinformatics/btaa167.
3
DNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data.DNApi:一种用于小 RNA 测序数据的从头预测接头算法。
PLoS One. 2016 Oct 13;11(10):e0164228. doi: 10.1371/journal.pone.0164228. eCollection 2016.
4
Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python.使用pytximport进行基因计数估计能够在Python中对批量RNA测序数据进行可重复分析。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae700.
5
Semblans: automated assembly and processing of RNA-seq data.Semblans:RNA测序数据的自动化组装与处理
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btaf003.
6
MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.MetaSRA:序列读取档案中标准化的人类样本特定元数据。
Bioinformatics. 2017 Sep 15;33(18):2914-2923. doi: 10.1093/bioinformatics/btx334.
7
The CAIRR Pipeline for Submitting Standards-Compliant B and T Cell Receptor Repertoire Sequencing Studies to the National Center for Biotechnology Information Repositories.CAIRR 管道用于向国家生物技术信息中心存储库提交符合标准的 B 和 T 细胞受体文库测序研究。
Front Immunol. 2018 Aug 16;9:1877. doi: 10.3389/fimmu.2018.01877. eCollection 2018.
8
Rcount: simple and flexible RNA-Seq read counting.Rcount:简单灵活的 RNA-Seq 读计数。
Bioinformatics. 2015 Feb 1;31(3):436-7. doi: 10.1093/bioinformatics/btu680. Epub 2014 Oct 15.
9
pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.pysradb:一个用于查询来自NCBI序列读取存档库的下一代测序元数据和数据的Python包。
F1000Res. 2019 Apr 23;8:532. doi: 10.12688/f1000research.18676.1. eCollection 2019.
10
NxTrim: optimized trimming of Illumina mate pair reads.NxTrim:优化的 Illumina 配对读取修剪。
Bioinformatics. 2015 Jun 15;31(12):2035-7. doi: 10.1093/bioinformatics/btv057. Epub 2015 Feb 5.

本文引用的文献

1
Ensembl 2024.Ensembl 2024.
Nucleic Acids Res. 2024 Jan 5;52(D1):D891-D899. doi: 10.1093/nar/gkad1049.
2
An updated nomenclature for plant ribosomal protein genes.植物核糖体蛋白基因的更新命名法。
Plant Cell. 2023 Feb 20;35(2):640-643. doi: 10.1093/plcell/koac333.
3
Database resources of the national center for biotechnology information.国家生物技术信息中心数据库资源。
Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. doi: 10.1093/nar/gkab1112.
4
STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.STAT:一种快速、可扩展的基于 MinHash 的 k-mer 工具,用于评估 Sequence Read Archive 下一代序列提交。
Genome Biol. 2021 Sep 20;22(1):270. doi: 10.1186/s13059-021-02490-0.
5
BioContainers Registry: Searching Bioinformatics and Proteomics Tools, Packages, and Containers.生物容器注册中心:搜索生物信息学和蛋白质组学工具、包和容器。
J Proteome Res. 2021 Apr 2;20(4):2056-2061. doi: 10.1021/acs.jproteome.0c00904. Epub 2021 Feb 24.
6
Regulation of ribosomal protein genes: An ordered anarchy.核糖体蛋白基因的调控:有序的混乱。
Wiley Interdiscip Rev RNA. 2021 May;12(3):e1632. doi: 10.1002/wrna.1632. Epub 2020 Oct 10.
7
Bioconda: sustainable and comprehensive software distribution for the life sciences.生物conda:面向生命科学的可持续且全面的软件发行平台。
Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7.
8
Four simple recommendations to encourage best practices in research software.鼓励研究软件最佳实践的四条简单建议。
F1000Res. 2017 Jun 13;6. doi: 10.12688/f1000research.11407.1. eCollection 2017.
9
Salmon provides fast and bias-aware quantification of transcript expression.鲑鱼提供快速且无偏倚的转录本表达定量。
Nat Methods. 2017 Apr;14(4):417-419. doi: 10.1038/nmeth.4197. Epub 2017 Mar 6.
10
Near-optimal probabilistic RNA-seq quantification.近乎最优的概率 RNA-seq 定量。
Nat Biotechnol. 2016 May;34(5):525-7. doi: 10.1038/nbt.3519. Epub 2016 Apr 4.