基于卷积神经网络的细菌 IV 型分泌系统效应物注释,具有更高的准确性和更低的假阳性率。
Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery.
机构信息
College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.
School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China.
出版信息
Brief Bioinform. 2020 Sep 25;21(5):1825-1836. doi: 10.1093/bib/bbz120.
The type IV bacterial secretion system (SS) is reported to be one of the most ubiquitous SSs in nature and can induce serious conditions by secreting type IV SS effectors (T4SEs) into the host cells. Recent studies mainly focus on annotating new T4SE from the huge amount of sequencing data, and various computational tools are therefore developed to accelerate T4SE annotation. However, these tools are reported as heavily dependent on the selected methods and their annotation performance need to be further enhanced. Herein, a convolution neural network (CNN) technique was used to annotate T4SEs by integrating multiple protein encoding strategies. First, the annotation accuracies of nine encoding strategies integrated with CNN were assessed and compared with that of the popular T4SE annotation tools based on independent benchmark. Second, false discovery rates of various models were systematically evaluated by (1) scanning the genome of Legionella pneumophila subsp. ATCC 33152 and (2) predicting the real-world non-T4SEs validated using published experiments. Based on the above analyses, the encoding strategies, (a) position-specific scoring matrix (PSSM), (b) protein secondary structure & solvent accessibility (PSSSA) and (c) one-hot encoding scheme (Onehot), were identified as well-performing when integrated with CNN. Finally, a novel strategy that collectively considers the three well-performing models (CNN-PSSM, CNN-PSSSA and CNN-Onehot) was proposed, and a new tool (CNN-T4SE, https://idrblab.org/cnnt4se/) was constructed to facilitate T4SE annotation. All in all, this study conducted a comprehensive analysis on the performance of a collection of encoding strategies when integrated with CNN, which could facilitate the suppression of T4SS in infection and limit the spread of antimicrobial resistance.
IV 型细菌分泌系统(SS)被报道是自然界中最普遍的 SS 之一,它可以通过将 IV 型 SS 效应器(T4SE)分泌到宿主细胞中来引发严重的疾病。最近的研究主要集中在注释新的 T4SE 从大量测序数据,因此开发了各种计算工具来加速 T4SE 注释。然而,这些工具被报道严重依赖于所选的方法,它们的注释性能需要进一步提高。在这里,卷积神经网络(CNN)技术被用于通过整合多种蛋白质编码策略来注释 T4SE。首先,评估了集成 CNN 的九种编码策略的注释准确性,并与基于独立基准的流行 T4SE 注释工具进行了比较。其次,通过(1)扫描嗜肺军团菌亚种 ATCC 33152 的基因组和(2)预测使用已发表实验验证的真实非 T4SEs,系统地评估了各种模型的假发现率。基于上述分析,当与 CNN 集成时,(a)位置特异性评分矩阵(PSSM)、(b)蛋白质二级结构和溶剂可及性(PSSSA)和(c)one-hot 编码方案(Onehot)被确定为性能良好的编码策略。最后,提出了一种集体考虑三种性能良好的模型(CNN-PSSM、CNN-PSSSA 和 CNN-Onehot)的新策略,并构建了一个新的工具(CNN-T4SE,https://idrblab.org/cnnt4se/)来促进 T4SE 注释。总之,本研究对集成 CNN 时的一组编码策略的性能进行了全面分析,这有助于抑制感染中的 T4SS 并限制抗生素耐药性的传播。