Abe Takashi, Ikarashi Ryo, Mizoguchi Masaya, Otake Masashi, Ikemura Toshimichi
Department of Information Engineering, Faculty of Engineering, Niigata University.
Department of Bioscience, Nagahama Institute of Bio-Science and Technology.
Genes Genet Syst. 2020 Apr 22;95(1):11-19. doi: 10.1266/ggs.19-00041. Epub 2020 Mar 12.
As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.
由于对大量基因组和宏基因组序列数据进行了广泛解码,积累了大量无法通过序列相似性搜索预测其功能的基因,这些基因对科学或工业用途不大。当前的基因组和宏基因组测序很大程度上依赖于高通量和低成本方法。对于单个物种的基因组测序,高密度测序可以减少测序错误。然而,对于宏基因组序列,高密度测序不一定能提高序列质量,因为样本中可能存在多个未知基因组,包括密切相关物种的基因组。因此,迫切需要一种对序列错误具有鲁棒性的功能预测方法。在此,我们提出一种不依赖序列相似性搜索来预测蛋白质基因功能的方法。我们使用一种名为BLSOM(批量学习自组织映射)的无监督机器学习方法来处理短寡肽频率,此前开发了一种无需序列比对的方法,根据蛋白质直系同源簇(COG)对细菌蛋白质基因进行聚类,在机器学习过程中不使用来自COG的信息。这使得功能未知的蛋白质能够仅基于寡肽频率的相似性与功能已知的蛋白质聚类,尽管该方法需要高性能超级计算机(HPC)。基于使用HPC获得的广泛知识,我们现在开发了一种策略,仅使用寡肽频率距离(OPD)将功能未知的蛋白质与COG类别相关联,这可以在个人计算机(PC)级别的计算机上进行。OPD策略适用于预测序列相似性低的蛋白质的功能,在此应用于预测使用宏基因组测序发现的大量基因候选物的功能。