Suppr超能文献

CPPVec:一种基于蛋白质序列分布式表示的准确编码潜能预测器。

CPPVec: an accurate coding potential predictor based on a distributed representation of protein sequence.

机构信息

School of Computer Science, Hubei University of Technology, Wuhan, China.

School of Computer Science and Technology, Xidian University, Xi'an, China.

出版信息

BMC Genomics. 2023 May 17;24(1):264. doi: 10.1186/s12864-023-09365-7.

Abstract

Long non-coding RNAs (lncRNAs) play a crucial role in numbers of biological processes and have received wide attention during the past years. Since the rapid development of high-throughput transcriptome sequencing technologies (RNA-seq) lead to a large amount of RNA data, it is urgent to develop a fast and accurate coding potential predictor. Many computational methods have been proposed to address this issue, they usually exploit information on open reading frame (ORF), protein sequence, k-mer, evolutionary signatures, or homology. Despite the effectiveness of these approaches, there is still much room to improve. Indeed, none of these methods exploit the contextual information of RNA sequence, for example, k-mer features that counts the occurrence frequencies of continuous nucleotides (k-mer) in the whole RNA sequence cannot reflect local contextual information of each k-mer. In view of this shortcoming, here, we present a novel alignment-free method, CPPVec, which exploits the contextual information of RNA sequence for coding potential prediction for the first time, it can be easily implemented by distributed representation (e.g., doc2vec) of protein sequence translated from the longest ORF. The experimental findings demonstrate that CPPVec is an accurate coding potential predictor and significantly outperforms existing state-of-the-art methods.

摘要

长链非编码 RNA(lncRNA)在许多生物过程中发挥着关键作用,在过去几年中受到了广泛关注。由于高通量转录组测序技术(RNA-seq)的快速发展产生了大量的 RNA 数据,因此迫切需要开发一种快速而准确的编码潜能预测器。已经提出了许多计算方法来解决这个问题,它们通常利用开放阅读框(ORF)、蛋白质序列、k-mer、进化特征或同源性等信息。尽管这些方法很有效,但仍有很大的改进空间。事实上,这些方法都没有利用 RNA 序列的上下文信息,例如,在整个 RNA 序列中计算连续核苷酸(k-mer)出现频率的 k-mer 特征不能反映每个 k-mer 的局部上下文信息。针对这一缺点,我们首次提出了一种新的无比对方法 CPPVec,它首次利用 RNA 序列的上下文信息进行编码潜能预测,它可以通过从最长 ORF 翻译的蛋白质序列的分布式表示(例如 doc2vec)来轻松实现。实验结果表明,CPPVec 是一种准确的编码潜能预测器,明显优于现有的最先进方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de45/10193750/10c99140a6cc/12864_2023_9365_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验