用于大量基因组序列集体分析的人工智能：从大流行的严重急性呼吸综合征冠状病毒2的小基因组到人类基因组的各种实例。

AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.

作者信息

Ikemura Toshimichi, Iwasaki Yuki, Wada Kennosuke, Wada Yoshiko, Abe Takashi

机构信息

Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology.

Department of Information Engineering, Faculty of Engineering, Niigata University.

出版信息

Genes Genet Syst. 2021 Dec 16;96(4):165-176. doi: 10.1266/ggs.21-00025. Epub 2021 Sep 27.

DOI:10.1266/ggs.21-00025

PMID:34565757

Abstract

In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.

摘要

在遗传学及相关领域，大量数据，如基因组序列正在不断积累，适用于大数据分析的人工智能（AI）的应用变得愈发重要。对于基因组序列分析而言，尤其是为了获得意想不到的见解，能够在无需先验知识或特定模型的情况下从大数据中揭示新知识的无监督AI非常必要。我们开发了一种用于寡核苷酸组成的批量学习自组织映射（BLSOM），它可以揭示各种新的基因组特征。在此，我们解释通过BLSOM进行的数据挖掘：一种无监督AI。作为一个具体目标，我们首先选择了严重急性呼吸综合征冠状病毒2（SARS-CoV-2），因为通过全球范围内的努力已经积累了大量的病毒基因组序列。我们分析了主要在疫情第一年收集的超过60万条序列。针对短寡核苷酸（例如4至6聚体）的BLSOM能够将其分离为已知的进化枝，但更长的寡核苷酸进一步提高了分离能力，并揭示了已知进化枝内的亚分组情况。对于15聚体而言，基因组中大多只有一个拷贝；因此，疫情开始后出现的15聚体可能与突变相关，而针对15聚体的BLSOM揭示了导致分离为已知进化枝及其亚组的突变。在介绍详细的方法策略之后，我们解释了针对各种主题的BLSOM，例如针对来自几乎所有现有微生物的500多万条5千碱基片段序列的四核苷酸BLSOM及其在宏基因组研究中的应用。我们还解释了针对各种真核生物（包括鱼类、青蛙和果蝇物种）的BLSOM，并发现其在亲缘关系密切的物种之间具有很高的分离能力。在分析人类基因组时，我们发现在着丝粒和着丝粒周围异染色质区域的转录因子结合序列中存在富集现象。tRNA基因（转运RNA基因）可以根据其相应的氨基酸进行分离。