• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于寡肽频率距离从基因组和宏基因组序列预测基因功能的策略。

A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance.

作者信息

Abe Takashi, Ikarashi Ryo, Mizoguchi Masaya, Otake Masashi, Ikemura Toshimichi

机构信息

Department of Information Engineering, Faculty of Engineering, Niigata University.

Department of Bioscience, Nagahama Institute of Bio-Science and Technology.

出版信息

Genes Genet Syst. 2020 Apr 22;95(1):11-19. doi: 10.1266/ggs.19-00041. Epub 2020 Mar 12.

DOI:10.1266/ggs.19-00041
PMID:32161228
Abstract

As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.

摘要

由于对大量基因组和宏基因组序列数据进行了广泛解码,积累了大量无法通过序列相似性搜索预测其功能的基因,这些基因对科学或工业用途不大。当前的基因组和宏基因组测序很大程度上依赖于高通量和低成本方法。对于单个物种的基因组测序,高密度测序可以减少测序错误。然而,对于宏基因组序列,高密度测序不一定能提高序列质量,因为样本中可能存在多个未知基因组,包括密切相关物种的基因组。因此,迫切需要一种对序列错误具有鲁棒性的功能预测方法。在此,我们提出一种不依赖序列相似性搜索来预测蛋白质基因功能的方法。我们使用一种名为BLSOM(批量学习自组织映射)的无监督机器学习方法来处理短寡肽频率,此前开发了一种无需序列比对的方法,根据蛋白质直系同源簇(COG)对细菌蛋白质基因进行聚类,在机器学习过程中不使用来自COG的信息。这使得功能未知的蛋白质能够仅基于寡肽频率的相似性与功能已知的蛋白质聚类,尽管该方法需要高性能超级计算机(HPC)。基于使用HPC获得的广泛知识,我们现在开发了一种策略,仅使用寡肽频率距离(OPD)将功能未知的蛋白质与COG类别相关联,这可以在个人计算机(PC)级别的计算机上进行。OPD策略适用于预测序列相似性低的蛋白质的功能,在此应用于预测使用宏基因组测序发现的大量基因候选物的功能。

相似文献

1
A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance.一种基于寡肽频率距离从基因组和宏基因组序列预测基因功能的策略。
Genes Genet Syst. 2020 Apr 22;95(1):11-19. doi: 10.1266/ggs.19-00041. Epub 2020 Mar 12.
2
A novel bioinformatics strategy for function prediction of poorly-characterized protein genes obtained from metagenome analyses.一种从宏基因组分析中获得的功能未知蛋白基因的功能预测的新型生物信息学策略。
DNA Res. 2009 Oct;16(5):287-97. doi: 10.1093/dnares/dsp018. Epub 2009 Oct 3.
3
MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics.元真核生物敏感、高通量的基因发现和注释,用于大规模真核生物宏基因组学。
Microbiome. 2020 Apr 3;8(1):48. doi: 10.1186/s40168-020-00808-x.
4
From Gene Annotation to Function Prediction for Metagenomics.从宏基因组学的基因注释到功能预测
Methods Mol Biol. 2017;1611:27-34. doi: 10.1007/978-1-4939-7015-5_3.
5
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
6
A Novel Bioinformatics Strategy to Analyze Microbial Big Sequence Data for Efficient Knowledge Discovery: Batch-Learning Self-Organizing Map (BLSOM).一种用于分析微生物大序列数据以实现高效知识发现的新型生物信息学策略:批学习自组织映射(BLSOM)。
Microorganisms. 2013 Nov 20;1(1):137-157. doi: 10.3390/microorganisms1010137.
7
A novel bioinformatics strategy for searching industrially useful genome resources from metagenomic sequence libraries.一种从宏基因组序列文库中搜索具有工业用途的基因组资源的新型生物信息学策略。
Genes Genet Syst. 2011;86(1):53-66. doi: 10.1266/ggs.86.53.
8
Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。
BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.
9
Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads.快速而简单的基于蛋白质比对的微生物组测序读段中直系同源基因家族组装方法。
Microbiome. 2017 Jan 25;5(1):11. doi: 10.1186/s40168-017-0233-2.
10
Large-scale metagenomic sequence clustering on map-reduce clusters.在MapReduce集群上进行大规模宏基因组序列聚类
J Bioinform Comput Biol. 2013 Feb;11(1):1340001. doi: 10.1142/S0219720013400015. Epub 2012 Dec 25.