Suppr超能文献

公共基因组资源污染的流行情况及影响:以 43 个参考节肢动物组合为例。

Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies.

机构信息

UMR 5554, Institut des Sciences de l'Evolution; CNRS, University of Montpellier, IRD, EPHE, Montpellier, France

UMR 5554, Institut des Sciences de l'Evolution; CNRS, University of Montpellier, IRD, EPHE, Montpellier, France.

出版信息

G3 (Bethesda). 2020 Feb 6;10(2):721-730. doi: 10.1534/g3.119.400758.

Abstract

Thanks to huge advances in sequencing technologies, genomic resources are increasingly being generated and shared by the scientific community. The quality of such public resources are therefore of critical importance. Errors due to contamination are particularly worrying; they are widespread, propagate across databases, and can compromise downstream analyses, especially the detection of horizontally-transferred sequences. However we still lack consistent and comprehensive assessments of contamination prevalence in public genomic data. Here we applied a standardized procedure for foreign sequence annotation to 43 published arthropod genomes from the widely used Ensembl Metazoa database. This method combines information on sequence similarity and synteny to identify contaminant and putative horizontally-transferred sequences in any genome assembly, provided that an adequate reference database is available. We uncovered considerable heterogeneity in quality among arthropod assemblies, some being devoid of contaminant sequences, whereas others included hundreds of contaminant genes. Contaminants far outnumbered horizontally-transferred genes and were a major confounder of their detection, quantification and analysis. We strongly recommend that automated standardized decontamination procedures be systematically embedded into the submission process to genomic databases.

摘要

得益于测序技术的巨大进步,越来越多的基因组资源由科学界生成和共享。因此,这些公共资源的质量至关重要。污染引起的错误尤其令人担忧;它们广泛存在,在数据库中传播,并可能影响下游分析,尤其是水平转移序列的检测。然而,我们仍然缺乏对公共基因组数据中污染普遍性的一致和全面评估。在这里,我们应用了一种标准化的外来序列注释程序,对来自广泛使用的 Ensembl Metazoa 数据库的 43 个已发表的节肢动物基因组进行了分析。该方法结合了序列相似性和同线性信息,以识别任何基因组组装中的污染和可能的水平转移序列,前提是有一个足够的参考数据库。我们发现节肢动物组装之间存在相当大的质量异质性,有些组装没有污染序列,而有些组装则包含数百个污染基因。污染物的数量远远超过水平转移基因,是检测、定量和分析它们的主要干扰因素。我们强烈建议在向基因组数据库提交时系统地嵌入自动标准化的去污染程序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56b0/7003083/cae6015eb992/721f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验