Suppr超能文献

SourceFinder:一种基于机器学习的工具,用于从基因组组装中识别染色体、质粒和噬菌体序列。

SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies.

机构信息

National Food Institute, Technical University of Denmarkgrid.5170.3, Kongens Lyngby, Denmark.

Consortium for Advanced Science and Engineering, University of Chicago, Chicago, Illinois, USA.

出版信息

Microbiol Spectr. 2022 Dec 21;10(6):e0264122. doi: 10.1128/spectrum.02641-22. Epub 2022 Nov 15.

Abstract

High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into -mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using -mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/). Extra-chromosomal genes encoding antimicrobial resistance, metal resistance, and virulence provide selective advantages for bacterial survival under stress conditions and pose serious threats to human and animal health. These accessory genes can impact the composition of microbiomes by providing selective advantages to their hosts. Accurately identifying extra-chromosomal elements in genome sequence data are critical for understanding gene dissemination trajectories and taking preventative measures. Therefore, in this study, we developed a random forest classifier for identifying the source of bacterial chromosomal, plasmid, and bacteriophage sequences.

摘要

高通量基因组测序技术使研究复杂的遗传相互作用成为可能,包括质粒和噬菌体的水平基因转移。然而,由于基因组序列的可塑性和完整序列组装的困难,从组装的读取中识别这些元素仍然具有挑战性。在这项研究中,我们开发了一种使用随机森林的分类器,用于识别序列是否来自细菌染色体、质粒或噬菌体。该分类器是在来自数百种细菌的 23,211 个染色体、质粒和噬菌体序列的多样化集合上进行训练的。为了使分类器适应不完整的序列,每个完整序列被抽样为 5000 个核苷酸片段,并进一步细分为 -mers。这种三分类器成功地使用完整和部分基因组序列的 -mer 分布识别染色体、质粒和噬菌体,包括具有最低性能 0.939 的模拟宏基因组支架的接收器操作特性曲线 (AUC) 下的面积。这个名为 SourceFinder 的分类器已作为在线网络服务提供,以帮助社区预测组装细菌序列数据的染色体、质粒和噬菌体来源(https://cge.food.dtu.dk/services/SourceFinder/)。 编码抗生素抗性、金属抗性和毒力的染色体外基因为细菌在应激条件下的生存提供了选择性优势,对人类和动物健康构成严重威胁。这些附加基因可以通过为宿主提供选择性优势来影响微生物组的组成。准确识别基因组序列数据中的染色体外元件对于了解基因传播轨迹和采取预防措施至关重要。因此,在这项研究中,我们开发了一种用于识别细菌染色体、质粒和噬菌体序列来源的随机森林分类器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a86/9769690/798e73d85c51/spectrum.02641-22-f001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验