Tree of Life, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK.
G3 (Bethesda). 2024 Nov 6;14(11). doi: 10.1093/g3journal/jkae187.
The recent acceleration in genome sequencing targeting previously unexplored parts of the tree of life presents computational challenges. Samples collected from the wild often contain sequences from several organisms, including the target, its cobionts, and contaminants. Effective methods are therefore needed to separate sequences. Though advances in sequencing technology make this task easier, it remains difficult to taxonomically assign sequences from eukaryotic taxa that are not well represented in databases. Therefore, reference-based methods alone are insufficient. Here, I examine how we can take advantage of differences in sequence composition between organisms to identify symbionts, parasites, and contaminants in samples, with minimal reliance on reference data. To this end, I explore data from the Darwin Tree of Life project, including hundreds of high-quality HiFi read sets from insects. Visualizing two-dimensional representations of read tetranucleotide composition learned by a variational autoencoder can reveal distinct components of a sample. Annotating the embeddings with additional information, such as coding density, estimated coverage, or taxonomic labels allows rapid assessment of the contents of a dataset. The approach scales to millions of sequences, making it possible to explore unassembled read sets, even for large genomes. Combined with interactive visualization tools, it allows a large fraction of cobionts reported by reference-based screening to be identified. Crucially, it also facilitates retrieving genomes for which suitable reference data are absent.
最近,针对生命之树中以前未探索过的部分进行基因组测序的速度加快,这带来了计算方面的挑战。从野外采集的样本通常包含来自几种生物的序列,包括目标生物、其共生物和污染物。因此,需要有效的方法来分离序列。尽管测序技术的进步使这项任务变得更加容易,但仍然难以对数据库中代表性不足的真核生物分类群的序列进行分类学分配。因此,仅依靠基于参考的方法是不够的。在这里,我研究了如何利用生物之间序列组成的差异来识别样本中的共生生物、寄生虫和污染物,而对参考数据的依赖最小。为此,我探讨了来自达尔文生命之树项目的数据,包括来自昆虫的数百个高质量 HiFi 读取集。通过可视化变分自动编码器学习的读取四核苷酸组成的二维表示,可以揭示样本的不同成分。使用额外的信息(例如编码密度、估计的覆盖范围或分类标签)对嵌入进行注释,可以快速评估数据集的内容。该方法可扩展到数百万条序列,使得即使对于大型基因组,也可以探索未组装的读取集。结合交互式可视化工具,可以识别出基于参考筛选报告的大部分共生物。至关重要的是,它还可以方便地检索缺少合适参考数据的基因组。