Kyrouz Alison, Liu Lian, Qin Lixin, Tjaden Brian
Department of Computer Science, Wellesley College, Wellesley, MA 02481, United States.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf250.
The most challenging prokaryotic genes to identify often correspond to short ORFs (sORFs) encoding small proteins or to noncoding RNAs. RNA-seq experiments commonly evince small transcripts that do not correspond to annotated genes and are candidates for novel coding sORFs or small regulatory RNAs, but it can be difficult to accurately assess whether the numerous small transcripts are coding or not. We present Popcorn (PrOkaryotic Prediction of Coding OR Noncoding), a novel machine learning method for determining whether prokaryotic sequences are coding or noncoding. We find that Popcorn is effective in distinguishing coding from noncoding sequences, including coding sORFs and noncoding RNAs.
Freely available for use on the web at https://cs.wellesley.edu/∼btjaden/Popcorn. Source code available at https://github.com/btjaden/Popcorn and https://doi.org/10.5281/zenodo.15120075.
最难鉴定的原核生物基因通常对应于编码小蛋白的短开放阅读框(sORF)或非编码RNA。RNA测序实验通常会显示出与注释基因不对应的小转录本,这些小转录本是新型编码sORF或小调控RNA的候选者,但很难准确评估众多小转录本是否具有编码功能。我们提出了Popcorn(原核生物编码或非编码预测),这是一种用于确定原核生物序列是编码还是非编码的新型机器学习方法。我们发现Popcorn在区分编码序列和非编码序列方面很有效,包括编码sORF和非编码RNA。