School of Life Sciences, and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, Hubei, People's Republic of China.
School of Computer Science, and Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, Hubei, People's Republic of China.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae147.
Small open reading frames (smORFs) have been acknowledged to play various roles on essential biological pathways and affect human beings from diabetes to tumorigenesis. Predicting smORFs in silico is quite a prerequisite for processing the omics data. Here, we proposed the smORF-coding-potential-predicting framework, sOCP, which provides functions to construct a model for predicting novel smORFs in some species. The sOCP model constructed in human was based on in-frame features and the nucleotide bias around the start codon, and the small feature subset was proved to be competent enough and avoid overfitting problems for complicated models. It showed more advanced prediction metrics than previous methods and could correlate closely with experimental evidence in a heterogeneous dataset. The model was applied to Rattus norvegicus and exhibited satisfactory performance. We then scanned smORFs with ATG and non-ATG start codons from the human genome and generated a database containing about a million novel smORFs with coding potential. Around 72 000 smORFs are located on the lncRNA regions of the genome. The smORF-encoded peptides may be involved in biological pathways rare for canonical proteins, including glucocorticoid catabolic process and the prokaryotic defense system. Our work provides a model and database for human smORF investigation and a convenient tool for further smORF prediction in other species.
小开放阅读框(smORFs)已被确认在重要的生物途径中发挥各种作用,并影响从糖尿病到肿瘤发生的人类。在计算上预测 smORFs 是处理组学数据的一个非常必要的前提。在这里,我们提出了 smORF 编码潜力预测框架 sOCP,它提供了在某些物种中构建预测新 smORFs 的模型的功能。在人类中构建的 sOCP 模型基于框架内特征和起始密码子周围的核苷酸偏倚,并且已经证明小特征子集足以胜任并且避免了复杂模型的过拟合问题。它显示出比以前的方法更先进的预测指标,并且可以在异质数据集与实验证据密切相关。该模型应用于大鼠并表现出令人满意的性能。然后,我们从人类基因组中扫描具有 ATG 和非 ATG 起始密码子的 smORFs,并生成了一个包含约 100 万个具有编码潜力的新型 smORFs 的数据库。大约 72000 个 smORFs 位于基因组的 lncRNA 区域。smORF 编码的肽可能参与了很少涉及经典蛋白的生物途径,包括糖皮质激素代谢过程和原核防御系统。我们的工作为人类 smORF 研究提供了一个模型和数据库,并为其他物种的进一步 smORF 预测提供了一个方便的工具。