深度学习工具在长非编码 RNA 预测方面表现出色。

Deep learning tools are top performers in long non-coding RNA prediction.

机构信息

Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.

Institute of Biomedicine, University of Turku, Turku, Finland.

出版信息

Brief Funct Genomics. 2022 May 21;21(3):230-241. doi: 10.1093/bfgp/elab045.

DOI:10.1093/bfgp/elab045

PMID:35136929

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9123429/

Abstract

The increasing amount of transcriptomic data has brought to light vast numbers of potential novel RNA transcripts. Accurately distinguishing novel long non-coding RNAs (lncRNAs) from protein-coding messenger RNAs (mRNAs) has challenged bioinformatic tool developers. Most recently, tools implementing deep learning architectures have been developed for this task, with the potential of discovering sequence features and their interactions still not surfaced in current knowledge. We compared the performance of deep learning tools with other predictive tools that are currently used in lncRNA coding potential prediction. A total of 15 tools representing the variety of available methods were investigated. In addition to known annotated transcripts, we also evaluated the use of the tools in actual studies with real-life data. The robustness and scalability of the tools' performance was tested with varying sized test sets and test sets with different proportions of lncRNAs and mRNAs. In addition, the ease-of-use for each tested tool was scored. Deep learning tools were top performers in most metrics and labelled transcripts similarly with each other in the real-life dataset. However, the proportion of lncRNAs and mRNAs in the test sets affected the performance of all tools. Computational resources were utilized differently between the top-ranking tools, thus the nature of the study may affect the decision of choosing one well-performing tool over another. Nonetheless, the results suggest favouring the novel deep learning tools over other tools currently in broad use.

摘要

转录组数据的不断增加揭示了大量潜在的新型 RNA 转录本。准确区分新型长非编码 RNA（lncRNA）和编码蛋白质的信使 RNA（mRNA）一直是生物信息工具开发者面临的挑战。最近，针对这一任务开发了采用深度学习架构的工具，这些工具具有发现当前知识中尚未显现的序列特征及其相互作用的潜力。我们比较了深度学习工具与目前用于 lncRNA 编码潜力预测的其他预测工具的性能。总共研究了 15 种代表各种可用方法的工具。除了已知的注释转录本外，我们还在实际研究中使用真实数据评估了工具的使用情况。我们使用不同大小的测试集和不同比例的 lncRNA 和 mRNA 的测试集来测试工具性能的稳健性和可扩展性。此外，我们还对每个测试工具的易用性进行了评分。在大多数指标中，深度学习工具都是表现最好的，并且在真实数据集的标签转录本中彼此相似。然而，测试集中 lncRNA 和 mRNA 的比例会影响所有工具的性能。排名靠前的工具之间的计算资源利用方式不同，因此研究的性质可能会影响选择一个性能良好的工具而不是另一个工具的决策。尽管如此，结果表明，新型深度学习工具优于目前广泛使用的其他工具。