Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
RNA. 2021 Jan;27(1):80-98. doi: 10.1261/rna.074724.120. Epub 2020 Oct 14.
High-throughput RNA sequencing unveiled the complexity of transcriptome and significantly increased the records of long noncoding RNAs (lncRNAs), which were reported to participate in a variety of biological processes. Identification of lncRNAs is a key step in lncRNA analysis, and a bunch of bioinformatics tools have been developed for this purpose in recent years. While these tools allow us to identify lncRNA more efficiently and accurately, they may produce inconsistent results, making selection a confusing issue. We compared the performance of 41 analysis models based on 14 software packages and different data sets, including high-quality data and low-quality data from 33 species. In addition, computational efficiency, robustness, and joint prediction of the models were explored. As a practical guidance, key points for lncRNA identification under different situations were summarized. In this investigation, no one of these models could be superior to others under all test conditions. The performance of a model relied to a great extent on the source of transcripts and the quality of assemblies. As general references, FEELnc_all_cl, CPC, and CPAT_mouse work well in most species while COME, CNCI, and lncScore are good choices for model organisms. Since these tools are sensitive to different factors such as the species involved and the quality of assembly, researchers must carefully select the appropriate tool based on the actual data. Alternatively, our test suggests that joint prediction could behave better than any single model if proper models were chosen. All scripts/data used in this research can be accessed at http://bioinfo.ihb.ac.cn/elit.
高通量 RNA 测序揭示了转录组的复杂性,并显著增加了长非编码 RNA(lncRNA)的记录,这些 RNA 被报道参与了多种生物过程。lncRNA 的鉴定是 lncRNA 分析的关键步骤,近年来为此目的开发了许多生物信息学工具。虽然这些工具使我们能够更有效地和准确地识别 lncRNA,但它们可能会产生不一致的结果,使得选择成为一个令人困惑的问题。我们比较了基于 14 个软件包和不同数据集的 41 个分析模型的性能,包括来自 33 个物种的高质量数据和低质量数据。此外,还探讨了模型的计算效率、稳健性和联合预测。作为实际指导,总结了在不同情况下识别 lncRNA 的要点。在这项研究中,没有一个模型可以在所有测试条件下都优于其他模型。模型的性能在很大程度上取决于转录本的来源和组装的质量。作为一般参考,FEELnc_all_cl、CPC 和 CPAT_mouse 在大多数物种中表现良好,而 COME、CNCI 和 lncScore 是模式生物的不错选择。由于这些工具对涉及的物种和组装质量等不同因素敏感,研究人员必须根据实际数据仔细选择合适的工具。或者,如果选择了适当的模型,联合预测可能比任何单个模型表现更好。本研究中使用的所有脚本/数据都可以在 http://bioinfo.ihb.ac.cn/elit 上访问。