生物序列的特征提取方法：数学特征的比较研究。

Feature extraction approaches for biological sequences: a comparative study of mathematical features.

机构信息

Department of Computer Science, Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Campus Cornélio Procópio, 86300-000, Brazil.

Institute of Mathematics and Computer Sciences, University of São Paulo - USP, São Carlos, 13566-590, Brazil.

出版信息

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab011.

DOI:10.1093/bib/bbab011

PMID:33585910

Abstract

As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.

摘要

由于各种基因组测序项目的结果，越来越多的生物序列数据正在产生。尽管机器学习算法已成功应用于许多与基因组序列相关的问题，但结果在很大程度上受到所提取特征的类型和数量的影响。这种影响促使提出了新的算法和流水线提案，主要涉及特征提取问题，从生物集中提取有意义的区分信息具有挑战性。考虑到这一点，我们的工作提出了一种基于数学特征的特征提取方法的新研究（傅里叶、熵和复杂网络的数值映射）。作为案例研究，我们分析了长非编码 RNA 序列。此外，我们将这项工作分为三个研究。首先，我们用我们的综述中最关注的问题来评估我们的建议，例如 lncRNA 和 mRNA；其次，我们还在不同的分类问题中验证了数学特征，以预测 lncRNA 的类别，例如环状 RNA 序列；第三，我们分析了其在不平衡数据场景中的鲁棒性。实验结果证明了三个主要贡献：首先，对几种数学特征进行了深入研究；其次，提出了一种新的特征提取流水线；最后，在不同的 RNA 序列分类中具有出色的性能和鲁棒性。可用性：https://github.com/Bonidia/FeatureExtraction_BiologicalSequences。