Balajti Máté, Kandhari Rohan, Jurič Boris, Zavolan Mihaela, Kanitz Alexander
Biozentrum, University of Basel, Basel 4056, Switzerland.
Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf076.
The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3' adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.
HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at https://github.com/zavolanlab/htsinfer, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.
序列读取存档库是最大且增长最快的测序数据存储库之一,包含数十PB的测序读数。其数据被广泛的科学界使用,通常超出了产生这些数据的初始研究范畴。此类分析依赖于有关实验类型、文库以及测序读数所源自生物体的准确元数据。这些元数据通常由贡献者手动输入,过程容易出错且常常不完整。此外,用于验证描述文库的元数据的一致性和完整性以促进数据重用的易用计算工具在很大程度上并不存在。在此,我们介绍HTSinfer,这是一种基于Python的工具,可直接且仅从Illumina平台上生成的批量RNA测序数据中推断元数据。HTSinfer利用基因组序列信息和诊断基因来快速准确地推断文库来源、文库类型以及相对读取方向、3'接头序列和读取长度统计信息。HTSinfer以模块化方式编写,根据允许的自由和开源许可发布,并鼓励社区贡献,从而能够轻松添加新功能,例如用于推断其他指标,或支持不同的实验类型或测序平台。
HTSinfer根据Apache许可证2.0发布。最新代码可通过GitHub获取,网址为https://github.com/zavolanlab/htsinfer,而版本发布在Bioconda上。本文中描述的HTSinfer版本的快照已存于Zenodo,链接为10.5281/zenodo.13985958。