National Institute of Immunology, Aruna Asaf Ali Marg, New Delhi 110067, India.
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad101.
Small open reading frames (smORFs) encoding proteins less than 100 amino acids (aa) are known to be important regulators of key cellular processes. However, their computational identification remains a challenge. Based on a comprehensive analysis of known prokaryotic small ORFs, we have developed the ProsmORF-pred resource which uses a machine learning (ML)-based method for prediction of smORFs in the prokaryotic genome sequences. ProsmORF-pred consists of two ML models, one for initiation site recognition in nucleic acid sequences upstream of putative start codons and the other uses translated amino acid sequences to decipher functional protein like sequences. The nucleotide sequence-based initiation site recognition model has been trained using longer ORFs (>100 aa) in the same genome while the ML model for identification of protein like sequences has been trained using annotated smORFs from Escherichia coli. Comprehensive benchmarking of ProsmORF-pred reveals that its performance is comparable to other state-of-the-art approaches on the annotated smORF set derived from 32 prokaryotic genomes. Its performance is distinctly superior to other tools like PRODIGAL and RANSEPS for prediction of newly identified smORFs which have a length range of 10-30 aa, where prediction of smORFs has been a major challenge. Apart from identification of smORFs in genomic sequences, ProsmORF-pred can also aid in functional annotation of the predicted smORFs based on sequence similarity and genomic neighbourhood similarity searches in ProsmORFDB, a well-curated database of known smORFs. ProsmORF-pred along with its backend database ProsmORFDB is available as a user-friendly web server (http://www.nii.ac.in/prosmorfpred.html).
已知编码长度小于 100 个氨基酸的小开放阅读框 (smORFs) 是调控关键细胞过程的重要因子。然而,它们的计算鉴定仍然是一个挑战。基于对已知原核生物小 ORF 的全面分析,我们开发了 ProsmORF-pred 资源,该资源使用基于机器学习 (ML) 的方法预测原核生物基因组序列中的 smORF。ProsmORF-pred 由两个 ML 模型组成,一个用于识别潜在起始密码子上游核酸序列中的起始位点,另一个使用翻译后的氨基酸序列来破译功能类似蛋白质的序列。基于核苷酸序列的起始位点识别模型是使用同一基因组中长度大于 100 个氨基酸的 ORF 进行训练的,而用于识别类似蛋白质序列的 ML 模型是使用大肠杆菌中注释的 smORF 进行训练的。对 ProsmORF-pred 的全面基准测试表明,它在基于 32 个原核生物基因组的注释 smORF 集上的性能与其他最先进的方法相当。与 PRODIGAL 和 RANSEPS 等其他工具相比,它在预测长度范围为 10-30 个氨基酸的新鉴定的 smORF 方面表现更为出色,因为预测 smORF 一直是一个主要挑战。除了在基因组序列中鉴定 smORF 之外,ProsmORF-pred 还可以根据 ProsmORFDB 中的序列相似性和基因组邻近相似性搜索,帮助预测的 smORF 进行功能注释,ProsmORFDB 是一个精心整理的已知 smORF 数据库。ProsmORF-pred 及其后端数据库 ProsmORFDB 可作为一个用户友好的网络服务器 (http://www.nii.ac.in/prosmorfpred.html) 使用。