使用 RESPECT 从低覆盖度基因组草图估算重复谱和基因组长度。

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT.

机构信息

Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America.

Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America.

出版信息

PLoS Comput Biol. 2021 Nov 15;17(11):e1009449. doi: 10.1371/journal.pcbi.1009449. eCollection 2021 Nov.

DOI:10.1371/journal.pcbi.1009449

PMID:34780468

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8629397/

Abstract

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.

摘要

与组装和完成基因组相比，测序基因组的成本下降速度要快得多。使用轻度采样的基因组（基因组扫描）可能会对基因组生态学产生变革性的影响，并且使用 k-mer 的结果表明了这种方法在鉴定和真核生物物种的系统发育定位方面的优势。在这里，我们重新审视了估计基因组参数（如基因组长度、覆盖率和重复结构）的基本问题，特别是关注估计 k-mer 重复谱的问题。我们通过理论和经验分析表明，由于病态系统的存在，估计 k-mer 谱存在根本的局限性，这对其他基因组参数也有影响。我们通过一种新的约束优化方法（样条线性规划）来解决这个问题，其中约束是通过经验学习得到的。在从 66 个基因组中以 1X 覆盖率模拟的读取中，我们的方法 REPeat SPECTra Estimation（RESPECT）在长度估计方面的误差为 2.2%，而之前的误差为 27%。在含有污染物的测序读取样本中，RESPECT 的长度估计中位数误差为 4%，而其他方法的中位数误差为 80%。总的来说，这些结果表明，高通量基因组测序可以可靠地估计基因组的长度和重复含量。RESPECT 软件将在 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e= 上公开。

相似文献

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT.使用 RESPECT 从低覆盖度基因组草图估算重复谱和基因组长度。

PLoS Comput Biol. 2021 Nov 15;17(11):e1009449. doi: 10.1371/journal.pcbi.1009449. eCollection 2021 Nov.

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.基于k谱的下一代测序数据分析纠错方法的比较研究。

Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

Phylogenetic signal in the eukaryotic tree of life.真核生物生命之树中的系统发育信号。

Science. 2008 Jul 4;321(5885):121-3. doi: 10.1126/science.1154449.

Repeat-aware modeling and correction of short read errors.重复感知建模和短读错误纠正。

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S52. doi: 10.1186/1471-2105-12-S1-S52.

Estimating the repeat structure and length of DNA sequences using L-tuples.使用L元组估计DNA序列的重复结构和长度。

Genome Res. 2003 Aug;13(8):1916-22. doi: 10.1101/gr.1251803.

Testing Efficacy of Assembly-Free and Alignment-Free Methods for Species Identification Using Genome Skims, with Patellogastropoda as a Test Case.利用基因组草图，免组装和免比对方法对物种鉴定的功效测试，以帽贝形腹足纲软体动物作为测试案例。

Genes (Basel). 2022 Jul 2;13(7):1192. doi: 10.3390/genes13071192.

TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees.TreeShrink：快速准确地检测系统发育树集合中的异常长分支。

BMC Genomics. 2018 May 8;19(Suppl 5):272. doi: 10.1186/s12864-018-4620-2.

Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to -Mer Annotation.集最小草图：用于幂律分布的概率图及其在 -Mer 注释中的应用。

J Comput Biol. 2022 Feb;29(2):140-154. doi: 10.1089/cmb.2021.0429. Epub 2022 Jan 18.

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.一种计算K-mer频率的新方法及其在大型重复植物基因组注释中的应用。

BMC Genomics. 2008 Oct 31;9:517. doi: 10.1186/1471-2164-9-517.

引用本文的文献

Deciphering the genomic characters of Ptygonotus chinghaiensis, a high-altitude grasshopper endemic to the Qinghai-Tibet Plateau, using low-coverage short-read sequencing.利用低覆盖度短读长测序解析青藏高原特有高海拔蝗虫青海拟裸蝗的基因组特征。

BMC Genomics. 2025 Jul 19;26(1):675. doi: 10.1186/s12864-025-11824-2.

A first nuclear and mitochondrial genomic portrait of Robinson Crusoe's (Juan Fernández Island) spiny lobster Jasus frontalis (Crustacea: Decapoda: Achelata).鲁滨逊·克鲁索岛（胡安·费尔南德斯岛）多刺龙虾（Jasus frontalis）（甲壳纲：十足目：螯龙虾科）的首张核基因组和线粒体基因组图谱。

BMC Genomics. 2025 Jul 1;26(1):623. doi: 10.1186/s12864-025-11794-5.

Insight Into the Nuclear and Mitochondrial Genome of the Caribbean King Crab (Crustacea: Brachyura: Mithracidae) to Support Fisheries Management and Conservation Initiatives.深入了解加勒比帝王蟹（甲壳纲：短尾亚目：梭子蟹科）的核基因组和线粒体基因组以支持渔业管理和保护举措。

Ecol Evol. 2025 Jun 29;15(7):e71619. doi: 10.1002/ece3.71619. eCollection 2025 Jul.

Specimen Identification Through Multilocus Species Tree Constructed From Single-Copy Orthologs (SCOs): A Case Study in Subgenus .通过从单拷贝直系同源基因构建的多基因座物种树进行样本鉴定：以亚属为例的研究

Ecol Evol. 2025 Apr 24;15(4):e71323. doi: 10.1002/ece3.71323. eCollection 2025 Apr.

-mer approaches for biodiversity genomics.用于生物多样性基因组学的-mer方法。

Genome Res. 2025 Feb 14;35(2):219-230. doi: 10.1101/gr.279452.124.

Reference genome of Calochortus tolmiei Hook. & Arn. (Liliaceae), a cat's ear mariposa lily.猫耳蝴蝶百合（Calochortus tolmiei Hook. & Arn.，百合科）的参考基因组

G3 (Bethesda). 2025 Mar 18;15(3). doi: 10.1093/g3journal/jkaf008.

A Snakemake Toolkit for the Batch Assembly, Annotation and Phylogenetic Analysis of Mitochondrial Genomes and Ribosomal Genes From Genome Skims of Museum Collections.一种用于对博物馆馆藏基因组草图中的线粒体基因组和核糖体基因进行批量组装、注释及系统发育分析的Snakemake工具包。

Mol Ecol Resour. 2025 Jan;25(1):e14036. doi: 10.1111/1755-0998.14036. Epub 2024 Oct 28.

A nuclear genome assembly of an extinct flightless bird, the little bush moa.灭绝的不会飞的鸟——小丛恐鸟的核基因组组装。

Sci Adv. 2024 May 24;10(21):eadj6823. doi: 10.1126/sciadv.adj6823. Epub 2024 May 23.

Analyses of Nuclear Reads Obtained Using Genome Skimming.基于基因组重测序的核型分析。

Methods Mol Biol. 2024;2744:247-265. doi: 10.1007/978-1-0716-3581-0_16.

Insights into the nuclear and mitochondrial genome of the Lemon shark using low-coverage sequencing: Genome size, repetitive elements, mitochondrial genome, and phylogenetic placement.利用低覆盖度测序揭示柠檬鲨的核和线粒体基因组：基因组大小、重复元件、线粒体基因组和系统发育定位。

Gene. 2024 Feb 5;894. doi: 10.1016/j.gene.2023.147939. Epub 2023 Oct 29.

本文引用的文献

CONSULT: accurate contamination removal using locality-sensitive hashing.咨询：使用局部敏感哈希进行精确的污染去除。

NAR Genom Bioinform. 2021 Aug 5;3(3):lqab071. doi: 10.1093/nargab/lqab071. eCollection 2021 Sep.

Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification.超越 DNA 条形码：基因组 skimming 数据在样本鉴定中的未实现潜力。

Mol Ecol. 2020 Jul;29(14):2521-2534. doi: 10.1111/mec.15507. Epub 2020 Jun 29.

SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0：Python 中的科学计算基础算法。

Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.

The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters.污染物对基因组掠取准确性和排除读取过滤器有效性的影响。

Mol Ecol Resour. 2020 May;20(3). doi: 10.1111/1755-0998.13135. Epub 2020 Feb 4.

Improved metagenomic analysis with Kraken 2.Kraken 2 提升宏基因组分析。

Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.

One thousand plant transcriptomes and the phylogenomics of green plants.一万种植物转录组与绿色植物的系统发生基因组学

Nature. 2019 Oct;574(7780):679-685. doi: 10.1038/s41586-019-1693-2. Epub 2019 Oct 23.

Decline of the North American avifauna.北美鸟类衰落。

Science. 2019 Oct 4;366(6461):120-124. doi: 10.1126/science.aaw1313. Epub 2019 Sep 19.

APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments.APPLS：基于距离的可扩展系统发育排列，无需或需进行比对。

Syst Biol. 2020 May 1;69(3):566-578. doi: 10.1093/sysbio/syz063.

Skmer: assembly-free and alignment-free sample identification using genome skims.Skmer：使用基因组草图进行无组装和无比对的样本识别。

Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.

Earth BioGenome Project: Sequencing life for the future of life.地球生物基因组计划：为生命的未来测序生命。

Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333. doi: 10.1073/pnas.1720115115.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用 RESPECT 从低覆盖度基因组草图估算重复谱和基因组长度。

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献