NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China.
Australian Institute for Machine Learning, The University of Adelaide, Adelaide, SA 5005, Australia.
Bioinformatics. 2020 Jun 1;36(12):3693-3702. doi: 10.1093/bioinformatics/btaa230.
Identification of virulence factors (VFs) is critical to the elucidation of bacterial pathogenesis and prevention of related infectious diseases. Current computational methods for VF prediction focus on binary classification or involve only several class(es) of VFs with sufficient samples. However, thousands of VF classes are present in real-world scenarios, and many of them only have a very limited number of samples available.
We first construct a large VF dataset, covering 3446 VF classes with 160 495 sequences, and then propose deep convolutional neural network models for VF classification. We show that (i) for common VF classes with sufficient samples, our models can achieve state-of-the-art performance with an overall accuracy of 0.9831 and an F1-score of 0.9803; (ii) for uncommon VF classes with limited samples, our models can learn transferable features from auxiliary data and achieve good performance with accuracy ranging from 0.9277 to 0.9512 and F1-score ranging from 0.9168 to 0.9446 when combined with different predefined features, outperforming traditional classifiers by 1-13% in accuracy and by 1-16% in F1-score.
All of our datasets are made publicly available at http://www.mgc.ac.cn/VFNet/, and the source code of our models is publicly available at https://github.com/zhengdd0422/VFNet.
Supplementary data are available at Bioinformatics online.
鉴定毒力因子(VF)对于阐明细菌发病机制和预防相关传染病至关重要。目前用于 VF 预测的计算方法主要关注二进制分类,或者只涉及具有足够样本的几类 VF。然而,在实际情况下存在数千种 VF 类别,其中许多类别的样本数量非常有限。
我们首先构建了一个大型 VF 数据集,涵盖 3446 个 VF 类别,共 160495 个序列,然后提出了用于 VF 分类的深度卷积神经网络模型。我们表明:(i)对于具有足够样本的常见 VF 类别,我们的模型可以达到最先进的性能,总体准确率为 0.9831,F1 得分为 0.9803;(ii)对于具有有限样本的罕见 VF 类别,我们的模型可以从辅助数据中学习可转移的特征,并通过与不同预定义特征相结合,实现准确率在 0.9277 到 0.9512 之间、F1 得分在 0.9168 到 0.9446 之间的良好性能,在准确率方面比传统分类器提高 1-13%,在 F1 得分方面提高 1-16%。
我们的所有数据集均在 http://www.mgc.ac.cn/VFNet/ 上公开提供,模型的源代码在 https://github.com/zhengdd0422/VFNet 上公开提供。
补充数据可在生物信息学在线获取。