基因组风格:另一种用于表征细菌基因组序列的深度学习方法。
Genomic style: yet another deep-learning approach to characterize bacterial genome sequences.
作者信息
Yoshimura Yuka, Hamada Akifumi, Augey Yohann, Akiyama Manato, Sakakibara Yasubumi
机构信息
Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan.
出版信息
Bioinform Adv. 2021 Dec 1;1(1):vbab039. doi: 10.1093/bioadv/vbab039. eCollection 2021.
MOTIVATION
Biological sequence classification is the most fundamental task in bioinformatics analysis. For example, in metagenome analysis, binning is a typical type of DNA sequence classification. In order to classify sequences, it is necessary to define sequence features. The -mer frequency, base composition and alignment-based metrics are commonly used. On the other hand, in the field of image recognition using machine learning, image classification is broadly divided into those based on shape and those based on style. A style matrix was introduced as a method of expressing the style of an image (e.g. color usage and texture).
RESULTS
We propose a novel sequence feature, called genomic style, inspired by image classification approaches, for classifying and clustering DNA sequences. As with the style of images, the DNA sequence is considered to have a genomic style unique to the bacterial species, and the style matrix concept is applied to the DNA sequence. Our main aim is to introduce the genomics style as yet another basic sequence feature for metagenome binning problem in replace of the most commonly used sequence feature -mer frequency. Performance evaluations showed that our method using a style matrix has the potential for accurate binning when compared with state-of-the-art binning tools based on -mer frequency.
AVAILABILITY AND IMPLEMENTATION
The source code for the implementation of this genomic style method, along with the dataset for the performance evaluation, is available from https://github.com/friendflower94/binning-style.
SUPPLEMENTARY INFORMATION
Supplementary data are available at online.
动机
生物序列分类是生物信息学分析中最基本的任务。例如,在宏基因组分析中,分箱是DNA序列分类的一种典型类型。为了对序列进行分类,有必要定义序列特征。常用的有k-mer频率、碱基组成和基于比对的指标。另一方面,在使用机器学习的图像识别领域,图像分类大致分为基于形状的和基于风格的。引入了风格矩阵作为表达图像风格(如颜色使用和纹理)的一种方法。
结果
我们受图像分类方法的启发,提出了一种名为基因组风格的新型序列特征,用于对DNA序列进行分类和聚类。与图像风格一样,DNA序列被认为具有细菌物种特有的基因组风格,并将风格矩阵概念应用于DNA序列。我们的主要目的是引入基因组风格,作为宏基因组分箱问题中另一种基本的序列特征,以取代最常用的序列特征k-mer频率。性能评估表明,与基于k-mer频率的现有最先进分箱工具相比,我们使用风格矩阵的方法具有进行准确分箱的潜力。
可用性和实现
用于实现这种基因组风格方法的源代码,以及性能评估数据集,可从https://github.com/friendflower94/binning-style获取。
补充信息
补充数据可在网上获取。