Suppr超能文献

使用 RESPECT 从低覆盖度基因组草图估算重复谱和基因组长度。

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT.

机构信息

Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America.

Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America.

出版信息

PLoS Comput Biol. 2021 Nov 15;17(11):e1009449. doi: 10.1371/journal.pcbi.1009449. eCollection 2021 Nov.

Abstract

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.

摘要

与组装和完成基因组相比,测序基因组的成本下降速度要快得多。使用轻度采样的基因组(基因组扫描)可能会对基因组生态学产生变革性的影响,并且使用 k-mer 的结果表明了这种方法在鉴定和真核生物物种的系统发育定位方面的优势。在这里,我们重新审视了估计基因组参数(如基因组长度、覆盖率和重复结构)的基本问题,特别是关注估计 k-mer 重复谱的问题。我们通过理论和经验分析表明,由于病态系统的存在,估计 k-mer 谱存在根本的局限性,这对其他基因组参数也有影响。我们通过一种新的约束优化方法(样条线性规划)来解决这个问题,其中约束是通过经验学习得到的。在从 66 个基因组中以 1X 覆盖率模拟的读取中,我们的方法 REPeat SPECTra Estimation(RESPECT)在长度估计方面的误差为 2.2%,而之前的误差为 27%。在含有污染物的测序读取样本中,RESPECT 的长度估计中位数误差为 4%,而其他方法的中位数误差为 80%。总的来说,这些结果表明,高通量基因组测序可以可靠地估计基因组的长度和重复含量。RESPECT 软件将在 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e= 上公开。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验