Department of Computational Biology, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, 61-614 Poznan, Poland.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad067. Epub 2023 Aug 17.
One of the most effective and useful methods to explore the content of biological databases is searching with nucleotide or protein sequences as a query. However, especially in the case of nucleic acids, due to the large volume of data generated by the next-generation sequencing (NGS) technologies, this approach is often not available. The hierarchical organization of the NGS records is primarily designed for browsing or text-based searches of the information provided in metadata-related keywords, limiting the efficiency of database exploration.
We developed an automated pipeline that incorporates the well-established NGS data-processing tools and procedures to allow easy and effective sampling of the NCBI SRA database records. Given a file with query nucleotide sequences, our tool estimates the matching content of SRA accessions by probing only a user-defined fraction of a record's sequences. Based on the selected parameters, it allows performing a full mapping experiment with records that meet the required criteria. The pipeline is designed to be easy to operate-it offers a fully automatic setup procedure and is fixed on tested supporting tools. The modular design and implemented usage modes allow a user to scale up the analyses into complex computational infrastructure.
We present an easy-to-operate and automated tool that expands the way a user can access and explore the information contained within the records deposited in the NCBI SRA database.
探索生物数据库内容的最有效和最有用的方法之一是以核苷酸或蛋白质序列作为查询进行搜索。然而,特别是在核酸的情况下,由于下一代测序(NGS)技术生成的大量数据,这种方法通常不可用。NGS 记录的层次结构主要设计用于浏览或基于文本的搜索元数据相关关键字中提供的信息,从而限制了数据库探索的效率。
我们开发了一个自动化管道,该管道结合了成熟的 NGS 数据处理工具和程序,以允许轻松有效地从 NCBI SRA 数据库记录中采样。给定一个包含查询核苷酸序列的文件,我们的工具通过仅探测记录序列的用户定义部分来估计 SRA 访问号的匹配内容。根据所选参数,可以使用符合要求标准的记录执行完整的映射实验。该管道旨在易于操作 - 它提供了一个全自动的设置过程,并固定在经过测试的支持工具上。模块化设计和实现的使用模式允许用户将分析扩展到复杂的计算基础设施中。
我们提出了一种易于操作和自动化的工具,该工具扩展了用户访问和探索 NCBI SRA 数据库中记录中包含的信息的方式。