D-sORF：对实验检测到的与翻译机制相关的小开放阅读框（sORF）进行准确的从头分类。

D-sORF: Accurate Ab Initio Classification of Experimentally Detected Small Open Reading Frames (sORFs) Associated with Translational Machinery.

作者信息

Perdikopanis Nikos, Giannakakis Antonis, Kavakiotis Ioannis, Hatzigeorgiou Artemis G

机构信息

Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece.

Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece.

出版信息

Biology (Basel). 2024 Jul 26;13(8):563. doi: 10.3390/biology13080563.

DOI:10.3390/biology13080563

PMID:39194501

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11351124/

Abstract

Small open reading frames (sORFs; <300 nucleotides or <100 amino acids) are widespread across all genomes, and an increasing variety of them appear to be translating from non-genic regions. Over the past few decades, peptides produced from sORFs have been identified as functional in various organisms, from bacteria to humans. Despite recent advances in next-generation sequencing and proteomics, accurate annotation and classification of sORFs remain a rate-limiting step toward reliable and high-throughput detection of small proteins from non-genic regions. Additionally, the cost of computational methods utilizing machine learning is lower than that of biological experiments, and they can be employed to detect sORFs, laying the groundwork for biological experiments. We present D-sORF, a machine-learning framework that integrates the statistical nucleotide context and motif information around the start codon to predict coding sORFs. D-sORF scores directly for coding identity and requires only the underlying genomic sequence, without incorporating parameters such as the conservation, which, in the case of sORFs, may increase the dispersion of scores within the significantly less conserved non-genic regions. D-sORF achieves 94.74% precision and 92.37% accuracy for small ORFs (using the 99 nt medium length window). When D-sORF is applied to sORFs associated with ribosomes, the identification of transcripts producing peptides (annotated by the Ensembl IDs) is similar to or superior to experimental methodologies based on ribosome-sequencing (Ribo-Seq) profiling. In parallel, the recognition of putative negative data, such as the intron-containing transcripts that associate with ribosomes, remains remarkably low, indicating that D-sORF could be efficiently applied to filter out false-positive sORFs from Ribo-Seq data because of the non-productive ribosomal binding or noise inherent in these protocols.

摘要

小开放阅读框（sORFs；<300个核苷酸或<100个氨基酸）广泛存在于所有基因组中，并且越来越多的小开放阅读框似乎在非基因区域进行翻译。在过去几十年中，已鉴定出由sORFs产生的肽在从细菌到人类的各种生物体中具有功能。尽管在下一代测序和蛋白质组学方面取得了最新进展，但sORFs的准确注释和分类仍然是从非基因区域可靠且高通量检测小蛋白质的限速步骤。此外，利用机器学习的计算方法成本低于生物学实验，并且可以用于检测sORFs，为生物学实验奠定基础。我们提出了D-sORF，这是一个机器学习框架，它整合了起始密码子周围的统计核苷酸上下文和基序信息来预测编码sORFs。D-sORF直接对编码同一性进行评分，并且只需要基础基因组序列，而不纳入诸如保守性等参数，对于sORFs而言，保守性可能会增加在保守性明显较低的非基因区域内评分的离散度。对于小ORFs（使用99 nt中等长度窗口），D-sORF的精确率达到94.74%，准确率达到92.37%。当将D-sORF应用于与核糖体相关的sORFs时，产生肽的转录本（由Ensembl ID注释）的识别与基于核糖体测序（Ribo-Seq）分析的实验方法相似或更优。同时，对假定阴性数据（如与核糖体相关的含内含子转录本）的识别率仍然非常低，这表明由于这些实验方案中固有的非生产性核糖体结合或噪声，D-sORF可以有效地应用于从Ribo-Seq数据中滤除假阳性sORFs。