一种基于直方图校正的高精度基因组大小估计器。

A high-precision genome size estimator based on the histogram correction.

作者信息

Liao Xiangyu, Zhu Wufei, Liu Chaoyun

机构信息

Department of Oncology, Yichang Central People's Hospital, The First College of Clinical Medical Science, China Three Gorges University, Yichang, China.

Department of Endocrinology, Yichang Central People's Hospital, The First College of Clinical Medical Science, China Three Gorges University, Yichang, China.

出版信息

Front Genet. 2024 Aug 22;15:1451730. doi: 10.3389/fgene.2024.1451730. eCollection 2024.

DOI:10.3389/fgene.2024.1451730

PMID:39238787

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11374637/

Abstract

INTRODUCTION

In the realm of next-generation sequencing datasets, various characteristics can be extracted through based analysis. Among these characteristics, genome size (GS) is one that can be estimated with relative ease, yet achieving satisfactory accuracy, especially in the context of heterozygosity, remains a challenge.

METHODS

In this study, we introduce a high-precision genome size estimator, (Genome Size Estimation Tool), which is based on histogram correction.

RESULTS

We have evaluated on both simulated and real datasets. The experimental results demonstrate that this tool can estimate genome size with greater precision, even surpassing the accuracy of state-of-the-art tools. Notably, GSET also performs satisfactorily on heterozygous datasets, where other tools struggle to produce useable results.

DISCUSSION

The processing model of diverges from the popular data fitting models used by similar tools. Instead, it is derived from empirical data and incorporates a correction term to mitigate the impact of sequencing errors on genome size estimation. is freely available for use and can be accessed at the following URL: https://github.com/Xingyu-Liao/GSET.

摘要

引言

在下一代测序数据集领域，可以通过基于[具体内容缺失]的分析提取各种特征。在这些特征中，基因组大小（GS）是相对容易估计的一个，但要达到令人满意的准确性，尤其是在杂合性背景下，仍然是一个挑战。

方法

在本研究中，我们引入了一种高精度基因组大小估计工具[具体名称缺失]（基因组大小估计工具），它基于[具体内容缺失]直方图校正。

结果

我们在模拟数据集和真实数据集上对[具体名称缺失]进行了评估。实验结果表明，该工具能够以更高的精度估计基因组大小，甚至超过了现有最先进工具的准确性。值得注意的是，GSET在杂合数据集上也表现出色，而其他工具在这类数据集上难以产生可用的结果。

讨论

[具体名称缺失]的处理模型与类似工具使用的流行数据拟合模型不同。相反，它源自经验数据，并纳入了一个校正项，以减轻测序错误对基因组大小估计的影响。[具体名称缺失]可免费使用，可通过以下网址访问：https://github.com/Xingyu-Liao/GSET。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5a3/11374637/724f8776149e/fgene-15-1451730-g001.jpg

相似文献

A high-precision genome size estimator based on the histogram correction.一种基于直方图校正的高精度基因组大小估计器。

Front Genet. 2024 Aug 22;15:1451730. doi: 10.3389/fgene.2024.1451730. eCollection 2024.

Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing.Lerna：用于配置短读和长读基因组测序错误纠正工具的变压器架构。

BMC Bioinformatics. 2022 Jan 6;23(1):25. doi: 10.1186/s12859-021-04547-0.

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.CARE 2.0：利用机器学习减少假阳性测序错误纠正。

BMC Bioinformatics. 2022 Jun 13;23(1):227. doi: 10.1186/s12859-022-04754-3.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.基于k谱的下一代测序数据分析纠错方法的比较研究。

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

ntCard: a streaming algorithm for cardinality estimation in genomics data.ntCard：一种用于基因组数据基数估计的流算法。

Bioinformatics. 2017 May 1;33(9):1324-1330. doi: 10.1093/bioinformatics/btw832.

Squeakr: an exact and approximate k-mer counting system.Squeakr：一种精确和近似的 k-mer 计数系统。

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly.长测序读段的迭代纠错可最大化准确性并改善重叠群组装。

Brief Bioinform. 2017 Jan;18(1):1-8. doi: 10.1093/bib/bbw003. Epub 2016 Feb 10.

Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。

BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.

Estimating the -mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art.估算基因组数据集中的-mer覆盖频率：对当前技术水平的比较评估。

Curr Genomics. 2019 Jan;20(1):2-15. doi: 10.2174/1389202919666181026101326.

An efficient error correction algorithm using FM-index.一种使用FM索引的高效错误校正算法。

BMC Bioinformatics. 2017 Nov 28;18(1):524. doi: 10.1186/s12859-017-1940-1.

本文引用的文献

LocoGSE, a sequence-based genome size estimator for plants.LocoGSE，一种基于序列的植物基因组大小估计工具。

Front Plant Sci. 2024 Mar 14;15:1328966. doi: 10.3389/fpls.2024.1328966. eCollection 2024.

Whole Genome Assembly of the Snout Otter Clam, , Using Nanopore and Illumina Data, Benchmarked Against Bivalve Genome Assemblies.使用纳米孔和Illumina数据对长吻獭蛤进行全基因组组装，并与双壳贝类基因组组装进行基准测试。

Front Genet. 2019 Nov 20;10:1158. doi: 10.3389/fgene.2019.01158. eCollection 2019.

Whole-genome sequence of the bovine blood fluke Schistosoma bovis supports interspecific hybridization with S. haematobium.牛血吸螺 Schistosoma bovis 的全基因组序列支持与 S. haematobium 的种间杂交。

PLoS Pathog. 2019 Jan 23;15(1):e1007513. doi: 10.1371/journal.ppat.1007513. eCollection 2019 Jan.

findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies.findGSE：使用 k -mer 频率估计人类和拟南芥基因组大小的变化。

Bioinformatics. 2018 Feb 15;34(4):550-557. doi: 10.1093/bioinformatics/btx637.

GenomeScope: fast reference-free genome profiling from short reads.GenomeScope：从短读长实现快速无参基因组剖析。

Bioinformatics. 2017 Jul 15;33(14):2202-2204. doi: 10.1093/bioinformatics/btx153.

Comparative genomics of the bacterial genus Streptococcus illuminates evolutionary implications of species groups.链球菌属的比较基因组学揭示了物种群的进化意义。

PLoS One. 2014 Jun 30;9(6):e101229. doi: 10.1371/journal.pone.0101229. eCollection 2014.

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads.利用全基因组鸟枪法短读长数据对高杂合基因组进行高效的从头组装。

Genome Res. 2014 Aug;24(8):1384-95. doi: 10.1101/gr.170720.113. Epub 2014 Apr 22.

pIRS: Profile-based Illumina pair-end reads simulator.pIRS：基于谱的 Illumina 双端读取模拟器。

Bioinformatics. 2012 Jun 1;28(11):1533-5. doi: 10.1093/bioinformatics/bts187. Epub 2012 Apr 15.

ART: a next-generation sequencing read simulator.ART：一种新一代测序读模拟程序。

Bioinformatics. 2012 Feb 15;28(4):593-4. doi: 10.1093/bioinformatics/btr708. Epub 2011 Dec 23.

Quake: quality-aware detection and correction of sequencing errors.Quake：测序错误的质量感知检测和校正。

Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种基于直方图校正的高精度基因组大小估计器。

A high-precision genome size estimator based on the histogram correction.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION

引言

方法

结果

讨论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献