Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR 86300-000, Brazil.
Empresa Brasileira de Pesquisa Agropecuária, Embrapa Café, Brasília, DF 70770-901, Brazil.
Nucleic Acids Res. 2018 Sep 19;46(16):e96. doi: 10.1093/nar/gky462.
With the emergence of Next Generation Sequencing (NGS) technologies, a large volume of sequence data in particular de novo sequencing was rapidly produced at relatively low costs. In this context, computational tools are increasingly important to assist in the identification of relevant information to understand the functioning of organisms. This work introduces BASiNET, an alignment-free tool for classifying biological sequences based on the feature extraction from complex network measurements. The method initially transform the sequences and represents them as complex networks. Then it extracts topological measures and constructs a feature vector that is used to classify the sequences. The method was evaluated in the classification of coding and non-coding RNAs of 13 species and compared to the CNCI, PLEK and CPC2 methods. BASiNET outperformed all compared methods in all adopted organisms and datasets. BASiNET have classified sequences in all organisms with high accuracy and low standard deviation, showing that the method is robust and non-biased by the organism. The proposed methodology is implemented in open source in R language and freely available for download at https://cran.r-project.org/package=BASiNET.
随着下一代测序 (NGS) 技术的出现,大量的序列数据,特别是从头测序,以相对较低的成本快速产生。在这种情况下,计算工具对于辅助识别相关信息以了解生物体的功能变得越来越重要。本工作介绍了 BASiNET,这是一种基于从复杂网络测量中提取特征的、用于对生物序列进行分类的无比对工具。该方法首先将序列转换并表示为复杂网络。然后,它提取拓扑度量并构建特征向量,用于对序列进行分类。该方法在对 13 个物种的编码和非编码 RNA 的分类中进行了评估,并与 CNCI、PLEK 和 CPC2 方法进行了比较。在所有采用的生物体和数据集上,BASiNET 均优于所有比较方法。BASiNET 以高精度和低标准差对所有生物体中的序列进行了分类,表明该方法是稳健的,不受生物体的影响。所提出的方法学以 R 语言实现为开源,并可在 https://cran.r-project.org/package=BASiNET 上免费下载。