• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用 FASTQ 文件中的自举样本测量病毒宏基因组分析的可重复性。

Measuring reproducibility of virus metagenomics analyses using bootstrap samples from FASTQ-files.

机构信息

Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Hannover D-30559, Germany.

Institute for Terrestrial and Aquatic Wildlife Research, University of Veterinary Medicine Hannover, Hannover D-30559, Germany.

出版信息

Bioinformatics. 2021 May 23;37(8):1068-1075. doi: 10.1093/bioinformatics/btaa926.

DOI:10.1093/bioinformatics/btaa926
PMID:33135067
Abstract

MOTIVATION

High-throughput sequencing data can be affected by different technical errors, e.g. from probe preparation or false base calling. As a consequence, reproducibility of experiments can be weakened. In virus metagenomics, technical errors can result in falsely identified viruses in samples from infected hosts. We present a new resampling approach based on bootstrap sampling of sequencing reads from FASTQ-files in order to generate artificial replicates of sequencing runs which can help to judge the robustness of an analysis. In addition, we evaluate a mixture model on the distribution of read counts per virus to identify potentially false positive findings.

RESULTS

The evaluation of our approach on an artificially generated dataset with known viral sequence content shows in general a high reproducibility of uncovering viruses in sequencing data, i.e. the correlation between original and mean bootstrap read count was highly correlated. However, the bootstrap read counts can also indicate reduced or increased evidence for the presence of a virus in the biological sample. We also found that the mixture-model fits well to the read counts, and furthermore, it provides a higher accuracy on the original or on the bootstrap read counts than on the difference between both. The usefulness of our methods is further demonstrated on two freely available real-world datasets from harbor seals.

AVAILABILITY AND IMPLEMENTATION

We provide a Phyton tool, called RESEQ, available from https://github.com/babaksaremi/RESEQ that allows efficient generation of bootstrap reads from an original FASTQ-file.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量测序数据可能会受到不同技术误差的影响,例如探针制备或碱基误报。因此,实验的可重复性可能会减弱。在病毒宏基因组学中,技术误差可能会导致从感染宿主样本中错误识别病毒。我们提出了一种新的基于从 FASTQ 文件中对测序reads 进行自举抽样的重采样方法,以便生成测序运行的人工副本,这有助于判断分析的稳健性。此外,我们评估了病毒reads 计数分布的混合模型,以识别潜在的假阳性发现。

结果

我们在具有已知病毒序列内容的人工生成数据集上评估了该方法,结果表明,在测序数据中发现病毒的重现性通常较高,即原始和平均自举读计数之间的相关性高度相关。然而,自举读计数也可能表明在生物样本中病毒的存在证据减少或增加。我们还发现,混合模型非常适合reads 计数,并且与原始或自举读计数相比,它在差异方面提供了更高的准确性。我们的方法在两个来自港湾海豹的免费真实数据集上的应用进一步证明了其有用性。

可用性和实现

我们提供了一个名为 RESEQ 的 Python 工具,可从 https://github.com/babaksaremi/RESEQ 获得,它允许从原始 FASTQ 文件中高效生成自举读。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

1
Measuring reproducibility of virus metagenomics analyses using bootstrap samples from FASTQ-files.使用 FASTQ 文件中的自举样本测量病毒宏基因组分析的可重复性。
Bioinformatics. 2021 May 23;37(8):1068-1075. doi: 10.1093/bioinformatics/btaa926.
2
A resampling strategy for studying robustness in virus detection pipelines.一种用于研究病毒检测管道稳健性的重采样策略。
Comput Biol Chem. 2021 Oct;94:107555. doi: 10.1016/j.compbiolchem.2021.107555. Epub 2021 Aug 2.
3
LiveKraken--real-time metagenomic classification of illumina data.LiveKraken--实时宏基因组 illumina 数据分析分类。
Bioinformatics. 2018 Nov 1;34(21):3750-3752. doi: 10.1093/bioinformatics/bty433.
4
FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model.FastqCLS:一种通过使用新型评分模型进行读段重排来压缩长读长测序FASTQ文件的工具。
Bioinformatics. 2022 Jan 3;38(2):351-356. doi: 10.1093/bioinformatics/btab696.
5
CIndex: compressed indexes for fast retrieval of FASTQ files.CIndex:用于快速检索FASTQ文件的压缩索引。
Bioinformatics. 2022 Jan 3;38(2):335-343. doi: 10.1093/bioinformatics/btab655.
6
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
7
BEETL-fastq: a searchable compressed archive for DNA reads.BEETL-fastq:一种用于DNA读数的可搜索压缩存档。
Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.
8
A comparison of strategies for generating artificial replicates in RNA-seq experiments.RNA-seq 实验中人工重复生成策略的比较。
Sci Rep. 2022 May 3;12(1):7170. doi: 10.1038/s41598-022-11302-9.
9
OGRE: Overlap Graph-based metagenomic Read clustEring.OGRE:基于重叠图的宏基因组读聚类。
Bioinformatics. 2021 May 17;37(7):905-912. doi: 10.1093/bioinformatics/btaa760.
10
The impact of FASTQ and alignment read order on structural variant calling from long-read sequencing data.FASTQ 和比对读序对长读测序数据结构变异调用的影响。
PeerJ. 2024 Mar 15;12:e17101. doi: 10.7717/peerj.17101. eCollection 2024.

引用本文的文献

1
Genomic reproducibility in the bioinformatics era.生物信息学时代的基因组可重复性。
Genome Biol. 2024 Aug 9;25(1):213. doi: 10.1186/s13059-024-03343-2.
2
Assessing Outlier Probabilities in Transcriptomics Data When Evaluating a Classifier.评估分类器时转录组学数据中的异常值概率。
Genes (Basel). 2023 Feb 1;14(2):387. doi: 10.3390/genes14020387.
3
A comparison of strategies for generating artificial replicates in RNA-seq experiments.RNA-seq 实验中人工重复生成策略的比较。
Sci Rep. 2022 May 3;12(1):7170. doi: 10.1038/s41598-022-11302-9.
4
Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks.应用人工神经网络后校正下一代测序数据中病毒分类群分布的估计。
Genes (Basel). 2021 Oct 31;12(11):1755. doi: 10.3390/genes12111755.