Knotify：一个使用句法模式识别进行RNA假结预测的高效并行平台。

Knotify: An Efficient Parallel Platform for RNA Pseudoknot Prediction Using Syntactic Pattern Recognition.

作者信息

Andrikos Christos, Makris Evangelos, Kolaitis Angelos, Rassias Georgios, Pavlatos Christos, Tsanakas Panayiotis

机构信息

School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece.

Hellenic Air Force Academy, Dekelia Air Base, Acharnes, 13671 Athens, Greece.

出版信息

Methods Protoc. 2022 Feb 2;5(1):14. doi: 10.3390/mps5010014.

DOI:10.3390/mps5010014

PMID:35200530

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8876629/

Abstract

Obtaining valuable clues for noncoding RNA (ribonucleic acid) subsequences remains a significant challenge, acknowledging that most of the human genome transcribes into noncoding RNA parts related to unknown biological operations. Capturing these clues relies on accurate "base pairing" prediction, also known as "RNA secondary structure prediction". As COVID-19 is considered a severe global threat, the single-stranded SARS-CoV-2 virus reveals the importance of establishing an efficient RNA analysis toolkit. This work aimed to contribute to that by introducing a novel system committed to predicting RNA secondary structure patterns (i.e., RNA's pseudoknots) that leverage syntactic pattern-recognition strategies. Having focused on the pseudoknot predictions, we formalized the secondary structure prediction of the RNA to be primarily a parsing and, secondly, an optimization problem. The proposed methodology addresses the problem of predicting pseudoknots of the first order (H-type). We introduce a context-free grammar (CFG) that affords enough expression power to recognize potential pseudoknot pattern. In addition, an alternative methodology of detecting possible pseudoknots is also implemented as well, using a brute-force algorithm. Any input sequence may highlight multiple potential folding patterns requiring a strict methodology to determine the single biologically realistic one. We conscripted a novel heuristic over the widely accepted notion of free-energy minimization to tackle such ambiguity in a performant way by utilizing each pattern's context to unveil the most prominent pseudoknot pattern. The overall process features polynomial-time complexity, while its parallel implementation enhances the end performance, as proportional to the deployed hardware. The proposed methodology does succeed in predicting the core stems of any RNA pseudoknot of the test dataset by performing a 76.4% recall ratio. The methodology achieved a F1-score equal to 0.774 and MCC equal 0.543 in discovering all the stems of an RNA sequence, outperforming the particular task. Measurements were taken using a dataset of 262 RNA sequences establishing a performance speed of 1.31, 3.45, and 7.76 compared to three well-known platforms. The implementation source code is publicly available under knotify github repo.

摘要

由于认识到人类基因组的大部分转录为与未知生物学操作相关的非编码RNA部分，因此获取非编码RNA（核糖核酸）子序列的有价值线索仍然是一项重大挑战。捕捉这些线索依赖于准确的“碱基配对”预测，也称为“RNA二级结构预测”。由于新冠疫情被视为严重的全球威胁，单链的新冠病毒揭示了建立高效RNA分析工具包的重要性。这项工作旨在通过引入一种致力于预测RNA二级结构模式（即RNA假结）的新型系统来为此做出贡献，该系统利用句法模式识别策略。专注于假结预测后，我们将RNA的二级结构预测形式化为主要是一个解析问题，其次是一个优化问题。所提出的方法解决了预测一阶（H型）假结的问题。我们引入了一种上下文无关文法（CFG），它具有足够的表达能力来识别潜在的假结模式。此外，还使用暴力算法实现了另一种检测可能假结的方法。任何输入序列可能会突出显示多种潜在的折叠模式，这需要严格的方法来确定唯一符合生物学实际的模式。我们基于广泛接受的自由能最小化概念引入了一种新颖的启发式方法，通过利用每种模式的上下文以高效的方式解决这种模糊性，从而揭示最突出的假结模式。整个过程具有多项式时间复杂度，而其并行实现提高了最终性能，与所部署的硬件成正比。所提出的方法通过实现76.4%的召回率，成功地预测了测试数据集的任何RNA假结的核心茎。该方法在发现RNA序列的所有茎时，F1分数等于0.774，MCC等于0.543，优于特定任务。使用包含262个RNA序列的数据集进行测量，与三个知名平台相比，性能速度分别为1.31、3.45和7.76。实现源代码在kotify github仓库中公开可用。