• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用序列组成解缠长读基因组数据中的共生物和污染。

Disentangling cobionts and contamination in long-read genomic data using sequence composition.

机构信息

Tree of Life, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.

出版信息

G3 (Bethesda). 2024 Nov 6;14(11). doi: 10.1093/g3journal/jkae187.

DOI:10.1093/g3journal/jkae187
PMID:39148415
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11540323/
Abstract

The recent acceleration in genome sequencing targeting previously unexplored parts of the tree of life presents computational challenges. Samples collected from the wild often contain sequences from several organisms, including the target, its cobionts, and contaminants. Effective methods are therefore needed to separate sequences. Though advances in sequencing technology make this task easier, it remains difficult to taxonomically assign sequences from eukaryotic taxa that are not well represented in databases. Therefore, reference-based methods alone are insufficient. Here, I examine how we can take advantage of differences in sequence composition between organisms to identify symbionts, parasites, and contaminants in samples, with minimal reliance on reference data. To this end, I explore data from the Darwin Tree of Life project, including hundreds of high-quality HiFi read sets from insects. Visualizing two-dimensional representations of read tetranucleotide composition learned by a variational autoencoder can reveal distinct components of a sample. Annotating the embeddings with additional information, such as coding density, estimated coverage, or taxonomic labels allows rapid assessment of the contents of a dataset. The approach scales to millions of sequences, making it possible to explore unassembled read sets, even for large genomes. Combined with interactive visualization tools, it allows a large fraction of cobionts reported by reference-based screening to be identified. Crucially, it also facilitates retrieving genomes for which suitable reference data are absent.

摘要

最近,针对生命之树中以前未探索过的部分进行基因组测序的速度加快,这带来了计算方面的挑战。从野外采集的样本通常包含来自几种生物的序列,包括目标生物、其共生物和污染物。因此,需要有效的方法来分离序列。尽管测序技术的进步使这项任务变得更加容易,但仍然难以对数据库中代表性不足的真核生物分类群的序列进行分类学分配。因此,仅依靠基于参考的方法是不够的。在这里,我研究了如何利用生物之间序列组成的差异来识别样本中的共生生物、寄生虫和污染物,而对参考数据的依赖最小。为此,我探讨了来自达尔文生命之树项目的数据,包括来自昆虫的数百个高质量 HiFi 读取集。通过可视化变分自动编码器学习的读取四核苷酸组成的二维表示,可以揭示样本的不同成分。使用额外的信息(例如编码密度、估计的覆盖范围或分类标签)对嵌入进行注释,可以快速评估数据集的内容。该方法可扩展到数百万条序列,使得即使对于大型基因组,也可以探索未组装的读取集。结合交互式可视化工具,可以识别出基于参考筛选报告的大部分共生物。至关重要的是,它还可以方便地检索缺少合适参考数据的基因组。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/d01d4acc8cd1/jkae187f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/e485409c6802/jkae187f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/a8ac4ffb5fb2/jkae187f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/292246cb63ac/jkae187f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/97cd7a440105/jkae187f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/d893d4f1e4c5/jkae187f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/d01d4acc8cd1/jkae187f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/e485409c6802/jkae187f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/a8ac4ffb5fb2/jkae187f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/292246cb63ac/jkae187f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/97cd7a440105/jkae187f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/d893d4f1e4c5/jkae187f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb0b/11540323/d01d4acc8cd1/jkae187f6.jpg

相似文献

1
Disentangling cobionts and contamination in long-read genomic data using sequence composition.利用序列组成解缠长读基因组数据中的共生物和污染。
G3 (Bethesda). 2024 Nov 6;14(11). doi: 10.1093/g3journal/jkae187.
2
Highly accurate long reads are crucial for realizing the potential of biodiversity genomics.高质量的长读长序列对于实现生物多样性基因组学的潜力至关重要。
BMC Genomics. 2023 Mar 16;24(1):117. doi: 10.1186/s12864-023-09193-9.
3
MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects.标记扫描:生物多样性基因组学项目中与目标物种同时测序的共生体的分离与组装。
Wellcome Open Res. 2024 Feb 13;9:33. doi: 10.12688/wellcomeopenres.20730.1. eCollection 2024.
4
acdc - Automated Contamination Detection and Confidence estimation for single-cell genome data.ACDC - 单细胞基因组数据的自动污染检测与置信度估计
BMC Bioinformatics. 2016 Dec 20;17(1):543. doi: 10.1186/s12859-016-1397-7.
5
A linked-read approach to museomics: Higher quality de novo genome assemblies from degraded tissues.链接读取方法在宏基因组学中的应用:从降解组织中获得更高质量的从头基因组组装。
Mol Ecol Resour. 2020 Jul;20(4):856-870. doi: 10.1111/1755-0998.13155. Epub 2020 May 11.
6
Estimating the composition of species in metagenomes by clustering of next-generation read sequences.通过对新一代测序读段序列进行聚类来估计宏基因组中物种的组成。
Methods. 2014 Oct 1;69(3):213-9. doi: 10.1016/j.ymeth.2014.07.009. Epub 2014 Jul 27.
7
Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations.评估真核生物基因组的长读长从头组装工具:见解与考虑。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24.
8
Towards population-scale long-read sequencing.迈向大规模长读长测序。
Nat Rev Genet. 2021 Sep;22(9):572-587. doi: 10.1038/s41576-021-00367-3. Epub 2021 May 28.
9
Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies.公共基因组资源污染的流行情况及影响:以 43 个参考节肢动物组合为例。
G3 (Bethesda). 2020 Feb 6;10(2):721-730. doi: 10.1534/g3.119.400758.
10
Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.基准测试显示深度学习变异调用程序在细菌纳米孔测序数据上的优越性。
Elife. 2024 Oct 10;13:RP98300. doi: 10.7554/eLife.98300.

引用本文的文献

1
Myxozoan parasite genomes assembled from contaminated host data reveal extensive gene order conservation and rapid sequence evolution.从受污染的宿主数据中组装的粘孢子虫寄生虫基因组揭示了广泛的基因顺序保守性和快速的序列进化。
G3 (Bethesda). 2025 Jul 9;15(7). doi: 10.1093/g3journal/jkaf061.

本文引用的文献

1
Genome evolution in intracellular parasites: Microsporidia and Apicomplexa.细胞内寄生原虫的基因组进化:微孢子虫和顶复门。
J Eukaryot Microbiol. 2024 Sep-Oct;71(5):e13033. doi: 10.1111/jeu.13033. Epub 2024 May 24.
2
MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects.标记扫描:生物多样性基因组学项目中与目标物种同时测序的共生体的分离与组装。
Wellcome Open Res. 2024 Feb 13;9:33. doi: 10.12688/wellcomeopenres.20730.1. eCollection 2024.
3
Rapid and sensitive detection of genome contamination at scale with FCS-GX.
使用 FCS-GX 实现大规模的基因组污染快速灵敏检测。
Genome Biol. 2024 Feb 26;25(1):60. doi: 10.1186/s13059-024-03198-7.
4
Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences.无监督深度学习可从未比对序列中识别蛋白质功能基团。
Genome Biol Evol. 2023 May 22;15(5). doi: 10.1093/gbe/evad084.
5
The genome sequence of the Adonis blue, (Rottemburg, 1775).阿迪尼斯蓝蝶(Rottemburg,1775年)的基因组序列。
Wellcome Open Res. 2022 Oct 12;7:255. doi: 10.12688/wellcomeopenres.18330.1. eCollection 2022.
6
Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project.基于达尔文生命之树生物多样性基因组计划的沃尔巴克氏体基因组的系统基因组学分析。
PLoS Biol. 2023 Jan 23;21(1):e3001972. doi: 10.1371/journal.pbio.3001972. eCollection 2023 Jan.
7
Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets.评价长读 shotgun 宏基因组测序数据集的分类和分析方法。
BMC Bioinformatics. 2022 Dec 13;23(1):541. doi: 10.1186/s12859-022-05103-0.
8
The genome sequence of the large white, (Linnaeus, 1758).大白猪的基因组序列,(林奈,1758年) 。
Wellcome Open Res. 2021 Oct 12;6:262. doi: 10.12688/wellcomeopenres.17274.1. eCollection 2021.
9
High-resolution species assignment of mosquitoes using -mer distances on targeted sequences.基于靶向序列的 -mer 距离对蚊子进行高分辨率种属鉴定。
Elife. 2022 Oct 12;11:e78775. doi: 10.7554/eLife.78775.
10
The genome sequence of the clouded yellow, (Geoffroy, 1785).黄钩蛱蝶(Geoffroy,1785年)的基因组序列。
Wellcome Open Res. 2021 Oct 22;6:284. doi: 10.12688/wellcomeopenres.17292.1. eCollection 2021.