Zhang Jingjing, Hossain Md Tofazzal, Liu Weiguo, Peng Yin, Pan Yi, Wei Yanjie
University of Chinese Academy of Sciences, Beijing, China.
Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
Front Genet. 2022 Feb 14;13:816825. doi: 10.3389/fgene.2022.816825. eCollection 2022.
The functional study on circRNAs has been increasing in the past decade due to its important roles in micro RNA sponge, protein coding, the initiation, and progression of diseases. The study of circRNA functions depends on the full-length sequences of circRNA, and current sequence assembly methods based on short reads face challenges due to the existence of linear transcript. Long reads produced by long-read sequencing techniques such as Nanopore technology can cover full-length sequences of circRNA and therefore can be used to evaluate the correctness and completeness of circRNA full sequences assembled from short reads of the same sample. Using long reads of the same samples, one from human and the other from mouse, we have comprehensively evaluated the performance of several well-known circRNA sequence assembly algorithms based on short reads, including circseq_cup, CIRI_full, and CircAST. Based on the F1 score, the performance of CIRI-full was better in human datasets, whereas in mouse datasets CircAST was better. In general, each algorithm was developed to handle special situations or circumstances. Our results indicated that no single assembly algorithm generated better performance in all cases. Therefore, these assembly algorithms should be used together for reliable full-length circRNA sequence reconstruction. After analyzing the results, we have introduced a screening protocol that selects out exonic circRNAs with full-length sequences consisting of all exons between back splice sites as the final result. After screening, CIRI-full showed better performance for both human and mouse datasets. The average F1 score of CIRI-full over four circRNA identification algorithms increased from 0.4788 to 0.5069 in human datasets, and it increased from 0.2995 to 0.4223 in mouse datasets.
在过去十年中,由于circRNA在微小RNA海绵、蛋白质编码、疾病的发生和发展中发挥着重要作用,对其功能的研究不断增加。circRNA功能的研究依赖于circRNA的全长序列,而目前基于短读长的序列组装方法由于线性转录本的存在面临挑战。由纳米孔技术等长读长测序技术产生的长读长可以覆盖circRNA的全长序列,因此可用于评估从同一样本的短读长组装的circRNA全序列的正确性和完整性。我们使用来自人类和小鼠的同一样本的长读长,全面评估了几种基于短读长的著名circRNA序列组装算法的性能,包括circseq_cup、CIRI_full和CircAST。基于F1分数,CIRI-full在人类数据集中的性能更好,而在小鼠数据集中CircAST表现更佳。一般来说,每种算法都是为处理特殊情况而开发的。我们的结果表明,没有一种组装算法在所有情况下都能产生更好的性能。因此,这些组装算法应一起使用,以进行可靠的全长circRNA序列重建。在分析结果后,我们引入了一种筛选方案,该方案选择出具有由反向剪接位点之间的所有外显子组成的全长序列的外显子circRNA作为最终结果。筛选后,CIRI-full在人类和小鼠数据集中均表现出更好的性能。在人类数据集中,CIRI-full在四种circRNA识别算法上的平均F1分数从0.4788提高到0.5069,在小鼠数据集中从0.2995提高到0.4223。