一种适用于大数据序列时代tRNA基因研究的人工智能方法。

An artificial intelligence approach fit for tRNA gene studies in the era of big sequence data.

作者信息

Iwasaki Yuki, Abe Takashi, Wada Kennosuke, Wada Yoshiko, Ikemura Toshimichi

机构信息

Department of Bioscience, Nagahama Institute of Bio-Science and Technology.

Department of Information Engineering, Faculty of Engineering, Niigata University.

出版信息

Genes Genet Syst. 2017 Sep 12;92(1):43-54. doi: 10.1266/ggs.16-00068. Epub 2017 Mar 24.

DOI:10.1266/ggs.16-00068

PMID:28344190

Abstract

Unsupervised data mining capable of extracting a wide range of knowledge from big data without prior knowledge or particular models is a timely application in the era of big sequence data accumulation in genome research. By handling oligonucleotide compositions as high-dimensional data, we have previously modified the conventional self-organizing map (SOM) for genome informatics and established BLSOM, which can analyze more than ten million sequences simultaneously. Here, we develop BLSOM specialized for tRNA genes (tDNAs) that can cluster (self-organize) more than one million microbial tDNAs according to their cognate amino acid solely depending on tetra- and pentanucleotide compositions. This unsupervised clustering can reveal combinatorial oligonucleotide motifs that are responsible for the amino acid-dependent clustering, as well as other functionally and structurally important consensus motifs, which have been evolutionarily conserved. BLSOM is also useful for identifying tDNAs as phylogenetic markers for special phylotypes. When we constructed BLSOM with 'species-unknown' tDNAs from metagenomic sequences plus 'species-known' microbial tDNAs, a large portion of metagenomic tDNAs self-organized with species-known tDNAs, yielding information on microbial communities in environmental samples. BLSOM can also enhance accuracy in the tDNA database obtained from big sequence data. This unsupervised data mining should become important for studying numerous functionally unclear RNAs obtained from a wide range of organisms.

摘要

在基因组研究中，无监督数据挖掘能够在无需先验知识或特定模型的情况下，从大数据中提取广泛的知识，这在大序列数据积累的时代是一种适时的应用。通过将寡核苷酸组成作为高维数据处理，我们之前对传统的自组织映射（SOM）进行了修改，用于基因组信息学，并建立了BLSOM，它能够同时分析超过一千万个序列。在这里，我们开发了专门用于tRNA基因（tDNA）的BLSOM，它可以仅根据四核苷酸和五核苷酸组成，根据其同源氨基酸对超过一百万个微生物tDNA进行聚类（自组织）。这种无监督聚类可以揭示负责氨基酸依赖性聚类的组合寡核苷酸基序，以及其他在功能和结构上重要的、在进化上保守的共有基序。BLSOM也可用于将tDNA识别为特殊系统型的系统发育标记。当我们用来自宏基因组序列的“未知物种”tDNA加上“已知物种”的微生物tDNA构建BLSOM时，很大一部分宏基因组tDNA与已知物种的tDNA自组织在一起，从而产生有关环境样品中微生物群落的信息。BLSOM还可以提高从大序列数据中获得的tDNA数据库的准确性。这种无监督数据挖掘对于研究从广泛生物体中获得的众多功能尚不清楚的RNA应该会变得很重要。