Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China.
Center for Quantitative Biology, Peking University, No. 5 Yiheyuan Road Haidian District, Beijing 100871, PR China.
Microb Genom. 2020 Nov;6(11). doi: 10.1099/mgen.0.000459. Epub 2020 Oct 19.
Plasmids are the key element in horizontal gene transfer in the microbial community. Recently, a large number of experimental and computational methods have been developed to obtain the plasmidomes of microbial communities. Distinguishing transmissible plasmid sequences, which are derived from conjugative or at least mobilizable plasmids, from non-transmissible plasmid sequences in the plasmidome is essential for understanding the diversity of plasmids and how they regulate the microbial community. Unfortunately, due to the highly fragmented characteristics of DNA sequences in the plasmidome, effective identification methods are lacking. In this work, we used information entropy from information theory to assess the randomness of synonymous codon usage over 4424 plasmid genomes. The results showed that for all amino acids, the choice of a synonymous codon in conjugative and mobilizable plasmids is more random than that in non-transmissible plasmids, indicating that transmissible plasmids have different sequence signatures from non-transmissible plasmids. Inspired by this phenomenon, we further developed a novel algorithm named PlasTrans. PlasTrans takes the triplet code sequences and base sequences of plasmid DNA fragments as input and uses the convolutional neural network of the deep learning technique to further extract the more complex signatures of the plasmid sequences and identify the conjugative and mobilizable DNA fragments. Tests showed that PlasTrans could achieve an AUC of as high as 84-91%, even though the fragments only contained hundreds of base pairs. To the best of our knowledge, this is the first quantitative analysis of the difference in sequence signatures between transmissible and non-transmissible plasmids, and we developed the first tool to perform transferability annotation for DNA fragments in the plasmidome. We expect that PlasTrans will be a useful tool for researchers who analyse the properties of novel plasmids in the microbial community and horizontal gene transfer, especially the spread of resistance genes and virulence factors associated with plasmids. PlasTrans is freely available via https://github.com/zhenchengfang/PlasTrans.
质粒是微生物群落中水平基因转移的关键因素。最近,已经开发出大量的实验和计算方法来获得微生物群落的质粒组。区分可传递质粒序列(来自可接合或至少可移动的质粒)和质粒组中的非可传递质粒序列对于理解质粒的多样性以及它们如何调节微生物群落至关重要。不幸的是,由于质粒组中 DNA 序列高度碎片化的特点,缺乏有效的识别方法。在这项工作中,我们使用信息论中的信息熵来评估 4424 个质粒基因组中同义密码子使用的随机性。结果表明,对于所有氨基酸,可接合和可移动质粒中同义密码子的选择比非可传递质粒更为随机,这表明可传递质粒与非可传递质粒具有不同的序列特征。受此现象启发,我们进一步开发了一种名为 PlasTrans 的新算法。PlasTrans 以三联体代码序列和质粒 DNA 片段的碱基序列作为输入,使用深度学习技术的卷积神经网络进一步提取质粒序列更复杂的特征,并识别可接合和可移动的 DNA 片段。测试表明,即使片段仅包含数百个碱基对,PlasTrans 也可以实现高达 84-91%的 AUC。据我们所知,这是首次对可传递和非可传递质粒之间序列特征差异的定量分析,我们开发了第一个工具,用于对质粒组中的 DNA 片段进行可转移性注释。我们期望 PlasTrans 将成为研究人员分析微生物群落和水平基因转移中新型质粒特性的有用工具,特别是与质粒相关的耐药基因和毒力因子的传播。PlasTrans 可通过 https://github.com/zhenchengfang/PlasTrans 免费获得。