Khanduja Akshay, Mohanty Debasisa
National Institute of Immunology, Aruna Asaf Ali Marg, New Delhi 110067, India.
NAR Genom Bioinform. 2025 Jan 7;7(1):lqae186. doi: 10.1093/nargab/lqae186. eCollection 2025 Mar.
Small proteins (≤100 amino acids) play important roles across all life forms, ranging from unicellular bacteria to higher organisms. In this study, we have developed SProtFP which is a machine learning-based method for functional annotation of prokaryotic small proteins into selected functional categories. SProtFP uses independent artificial neural networks (ANNs) trained using a combination of physicochemical descriptors for classifying small proteins into antitoxin type 2, bacteriocin, DNA-binding, metal-binding, ribosomal protein, RNA-binding, type 1 toxin and type 2 toxin proteins. We have also trained a model for identification of small open reading frame (smORF)-encoded antimicrobial peptides (AMPs). Comprehensive benchmarking of SProtFP revealed an average area under the receiver operator curve (ROC-AUC) of 0.92 during 10-fold cross-validation and an ROC-AUC of 0.94 and 0.93 on held-out balanced and imbalanced test sets. Utilizing our method to annotate bacterial isolates from the human gut microbiome, we could identify thousands of remote homologs of known small protein families and assign putative functions to uncharacterized proteins. This highlights the utility of SProtFP for large-scale functional annotation of microbiome datasets, especially in cases where sequence homology is low. SProtFP is freely available at http://www.nii.ac.in/sprotfp.html and can be combined with genome annotation tools such as ProsmORF-pred to uncover the functional repertoire of novel small proteins in bacteria.
小蛋白(≤100个氨基酸)在从单细胞细菌到高等生物的所有生命形式中都发挥着重要作用。在本研究中,我们开发了SProtFP,这是一种基于机器学习的方法,用于将原核小蛋白功能注释到选定的功能类别中。SProtFP使用独立的人工神经网络(ANN),通过结合物理化学描述符进行训练,将小蛋白分类为2型抗毒素、细菌素、DNA结合蛋白、金属结合蛋白、核糖体蛋白、RNA结合蛋白、1型毒素和2型毒素蛋白。我们还训练了一个用于识别小开放阅读框(smORF)编码的抗菌肽(AMP)的模型。SProtFP的综合基准测试显示,在10倍交叉验证期间,受试者工作特征曲线下面积(ROC-AUC)平均为0.92,在保留的平衡和不平衡测试集上的ROC-AUC分别为0.94和0.93。利用我们的方法对人类肠道微生物群中的细菌分离株进行注释,我们可以识别数千个已知小蛋白家族的远缘同源物,并为未表征的蛋白赋予推定功能。这突出了SProtFP在微生物组数据集大规模功能注释中的实用性,特别是在序列同源性较低的情况下。SProtFP可在http://www.nii.ac.in/sprotfp.html免费获取,并且可以与ProsmORF-pred等基因组注释工具结合使用,以揭示细菌中新型小蛋白的功能库。