利用开放阅读框和高能标度图的相互作用评估 RNA 编码潜能的方法。

A method for evaluating of RNA's coding potential using the interaction effects of open reading frames and high-energy scalograms.

机构信息

College of Forestry, Nanjing Forestry University, Longpan, Nanjing, 210037, Jiangsu, China; College of Information Science and Technology, Nanjing Forestry University, Longpan, Nanjing, 210037, Jiangsu, China.

The First Affiliated Hospital of Xi'an Jiaotong University, 277 West Yanta Road, Xi'an, 710061, Shaanxi, China.

出版信息

Comput Biol Med. 2024 Jan;168:107752. doi: 10.1016/j.compbiomed.2023.107752. Epub 2023 Nov 23.

DOI:10.1016/j.compbiomed.2023.107752

PMID:38007977

Abstract

The identification and function determination of long non-coding RNAs (lncRNAs) can help to better understand the transcriptional regulation in both normal development and disease pathology, thereby demanding methods to distinguish them from protein-coding (pcRNAs) after obtaining sequencing data. Many algorithms based on the statistical, structural, physical, and chemical properties of the sequences have been developed for evaluating the coding potential of RNA to distinguish them. In order to design common features that do not rely on hyperparameter tuning and optimization and are evaluated accurately, we designed a series of features from the effects of open reading frames (ORFs) on their mutual interactions and with the electrical intensity of sequence sites to further improve the screening accuracy. Finally, the single model constructed from our designed features meets the strong classifier criteria, where the accuracy is between 82% and 89%, and the prediction accuracy of the model constructed after combining the auxiliary features equal to or exceed some best classification tools. Moreover, our method does not require special hyper-parameter tuning operations and is species insensitive compared to other methods, which means this method can be easily applied to a wide range of species. Also, we find some correlations between the features, which provides some reference for follow-up studies.

摘要

长链非编码 RNA（lncRNAs）的鉴定和功能确定有助于更好地理解正常发育和疾病病理中的转录调控，因此在获得测序数据后需要方法将其与编码蛋白（pcRNAs）区分开来。已经开发了许多基于序列的统计、结构、物理和化学性质的算法，用于评估 RNA 的编码潜力以将其区分开来。为了设计不依赖于超参数调整和优化且评估准确的通用特征，我们从开放阅读框（ORFs）对其相互作用的影响以及序列位点的电强度设计了一系列特征，以进一步提高筛选准确性。最后，从我们设计的特征构建的单个模型满足强分类器标准，其中准确性在 82%到 89%之间，并且组合辅助特征构建的模型的预测准确性等于或超过某些最佳分类工具。此外，与其他方法相比，我们的方法不需要特殊的超参数调整操作，并且对物种不敏感，这意味着该方法可以轻松应用于广泛的物种。此外，我们发现了特征之间的一些相关性，这为后续研究提供了一些参考。