Suppr超能文献

CoCoPyE:用于基因组质量指标学习和预测的特征工程。

CoCoPyE: feature engineering for learning and prediction of genome quality indices.

机构信息

Department of Applied Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Goettingen, Germany.

出版信息

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae079.

Abstract

BACKGROUND

The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.

RESULTS

We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.

CONCLUSIONS

CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.

摘要

背景

通过从宏基因组序列数据中重建基因组,微生物世界的探索得到了极大的推进。然而,随着宏基因组组装基因组数量的快速增加,数据质量也存在很大差异。因此,在将重建的基因组用于后续分析之前,必须对其完成度和可能的污染进行定量评估。经典的质量指数估计方法仅依赖于相对较少的通用单拷贝基因。最近的工具试图通过增加基因组覆盖率来提高估计的准确性。

结果

我们开发了一种快速的工具 CoCoPyE,它基于一种新颖的两阶段特征提取和转换方案。首先,它识别基因组标记,然后使用机器学习方法对基于标记的估计进行细化。在我们的模拟研究中,CoCoPyE 显示出比现有工具更准确的质量指数预测。虽然 CoCoPyE 网络服务器提供了一种简单的试用工具的方法,但免费提供的 Python 实现可将其集成到现有的基因组重建管道中。

结论

CoCoPyE 提供了一种评估基因组数据质量的新方法。它补充和改进了现有的工具,并可能有助于研究人员在宏基因组测序项目中更好地区分低质量草案和高质量基因组组装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ae36/11503480/a45efc2b6f49/giae079fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验