• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

网络拓扑和多重比对可能性引导的高质量序列聚类。

High-quality sequence clustering guided by network topology and multiple alignment likelihood.

机构信息

Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France.

出版信息

Bioinformatics. 2012 Apr 15;28(8):1078-85. doi: 10.1093/bioinformatics/bts098. Epub 2012 Feb 25.

DOI:10.1093/bioinformatics/bts098
PMID:22368255
Abstract

MOTIVATION

Proteins can be naturally classified into families of homologous sequences that derive from a common ancestor. The comparison of homologous sequences and the analysis of their phylogenetic relationships provide useful information regarding the function and evolution of genes. One important difficulty of clustering methods is to distinguish highly divergent homologous sequences from sequences that only share partial homology due to evolution by protein domain rearrangements. Existing clustering methods require parameters that have to be set a priori. Given the variability in the evolution pattern among proteins, these parameters cannot be optimal for all gene families.

RESULTS

We propose a strategy that aims at clustering sequences homologous over their entire length, and that takes into account the pattern of substitution specific to each gene family. Sequences are first all compared with each other and clustered into pre-families, based on pairwise similarity criteria, with permissive parameters to optimize sensitivity. Pre-families are then divided into homogeneous clusters, based on the topology of the similarity network. Finally, clusters are progressively merged into families, for which we compute multiple alignments, and we use a model selection technique to find the optimal tradeoff between the number of families and multiple alignment likelihood. To evaluate this method, called HiFiX, we analyzed simulated sequences and manually curated datasets. These tests showed that HiFiX is the only method robust to both sequence divergence and domain rearrangements. HiFiX is fast enough to be used on very large datasets.

AVAILABILITY AND IMPLEMENTATION

The Python software HiFiX is freely available at http://lbbe.univ-lyon1.fr/hifix.

摘要

动机

蛋白质可以自然地分为同源序列家族,这些序列家族源自共同的祖先。同源序列的比较和它们的系统发育关系的分析为基因的功能和进化提供了有用的信息。聚类方法的一个重要难点是区分高度分歧的同源序列和由于蛋白质结构域重排而仅部分同源的序列。现有的聚类方法需要先验设置参数。鉴于蛋白质进化模式的可变性,这些参数不可能对所有基因家族都是最优的。

结果

我们提出了一种策略,旨在对整个长度同源的序列进行聚类,并考虑到每个基因家族特有的替代模式。首先,根据成对相似性标准,使用允许的参数来优化敏感性,将所有序列彼此进行比较并聚类为预家族。然后,根据相似性网络的拓扑结构将预家族划分为同质簇。最后,将簇逐步合并为家族,对于这些家族,我们计算多重比对,并使用模型选择技术来找到家族数量和多重比对可能性之间的最佳权衡。为了评估这种称为 HiFiX 的方法,我们分析了模拟序列和手动整理的数据集。这些测试表明,HiFiX 是唯一一种对序列分歧和结构域重排都具有鲁棒性的方法。HiFiX 足够快,可以用于非常大的数据集。

可用性和实现

Python 软件 HiFiX 可在 http://lbbe.univ-lyon1.fr/hifix 免费获得。

相似文献

1
High-quality sequence clustering guided by network topology and multiple alignment likelihood.网络拓扑和多重比对可能性引导的高质量序列聚类。
Bioinformatics. 2012 Apr 15;28(8):1078-85. doi: 10.1093/bioinformatics/bts098. Epub 2012 Feb 25.
2
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
3
Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.用于检测远程同源蛋白家族的聚类算法的评估与改进
BMC Bioinformatics. 2015 Feb 5;16:34. doi: 10.1186/s12859-014-0445-4.
4
Ultra-fast sequence clustering from similarity networks with SiLiX.使用 SiLiX 从相似度网络中进行超快速序列聚类。
BMC Bioinformatics. 2011 Apr 22;12:116. doi: 10.1186/1471-2105-12-116.
5
CLUSS: clustering of protein sequences based on a new similarity measure.CLUSS:基于一种新的相似性度量对蛋白质序列进行聚类。
BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286.
6
MACHOS: Markov clusters of homologous subsequences.MACHOS:同源子序列的马尔可夫聚类
Bioinformatics. 2008 Jul 1;24(13):i77-85. doi: 10.1093/bioinformatics/btn144.
7
Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins.利用结构比对和系统发育树数据库研究同源蛋白质序列与结构变异性之间的关系。
Protein Eng. 2001 Apr;14(4):219-26. doi: 10.1093/protein/14.4.219.
8
HoSeqI: automated homologous sequence identification in gene family databases.HoSeqI:基因家族数据库中的自动同源序列识别
Bioinformatics. 2006 Jul 15;22(14):1786-7. doi: 10.1093/bioinformatics/btl179. Epub 2006 May 8.
9
transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign:利用氨基酸促进蛋白质编码DNA序列的多重比对。
BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.
10
DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins.DIVCLUS:GEANFAMMER软件包中的一种自动方法,可在单结构域和多结构域蛋白质中找到同源结构域。
Bioinformatics. 1998;14(2):144-50. doi: 10.1093/bioinformatics/14.2.144.

引用本文的文献

1
Inference and reconstruction of the heimdallarchaeial ancestry of eukaryotes.真核生物 Heimdallarchaeia 祖先的推断和重建。
Nature. 2023 Jun;618(7967):992-999. doi: 10.1038/s41586-023-06186-2. Epub 2023 Jun 14.
2
The Molecular Determinants of Thermoadaptation: Methanococcales as a Case Study.热适应的分子决定因素:以甲烷球菌为例。
Mol Biol Evol. 2021 May 4;38(5):1761-1776. doi: 10.1093/molbev/msaa312.
3
Genome-wide analysis of the Firmicutes illuminates the diderm/monoderm transition.全基因组分析揭示了厚壁菌门的二型/单型过渡。
Nat Ecol Evol. 2020 Dec;4(12):1661-1672. doi: 10.1038/s41559-020-01299-7. Epub 2020 Oct 19.
4
Ancestral Reconstructions Decipher Major Adaptations of Ammonia-Oxidizing Archaea upon Radiation into Moderate Terrestrial and Marine Environments.祖先重建揭示了氨氧化古菌在辐射到中等陆地和海洋环境时的主要适应机制。
mBio. 2020 Oct 13;11(5):e02371-20. doi: 10.1128/mBio.02371-20.
5
Large-scale genome sequencing of mycorrhizal fungi provides insights into the early evolution of symbiotic traits.大规模基因组测序揭示了菌根真菌共生特征的早期进化。
Nat Commun. 2020 Oct 12;11(1):5125. doi: 10.1038/s41467-020-18795-w.
6
Ammonia Oxidation by the Arctic Terrestrial Thaumarchaeote Nitrosocosmicus arcticus Is Stimulated by Increasing Temperatures.北极陆地奇古菌北极亚硝化球菌的氨氧化作用受温度升高刺激。
Front Microbiol. 2019 Jul 17;10:1571. doi: 10.3389/fmicb.2019.01571. eCollection 2019.
7
Bioinformatic and mutational studies of related toxin-antitoxin pairs in predict and identify key functional residues.对相关毒素-抗毒素对的生物信息学和突变研究,有助于预测和鉴定关键功能残基。
J Biol Chem. 2019 Jun 7;294(23):9048-9063. doi: 10.1074/jbc.RA118.006814. Epub 2019 Apr 24.
8
De novo clustering of long reads by gene from transcriptomics data.基于转录组学数据的基因从头聚类长读长。
Nucleic Acids Res. 2019 Jan 10;47(1):e2. doi: 10.1093/nar/gky834.
9
Nitrosocaldus cavascurensis, an Ammonia Oxidizing, Extremely Thermophilic Archaeon with a Highly Mobile Genome.卡瓦斯科尔杜斯亚硝化嗜热菌,一种具有高度可移动基因组的氨氧化超嗜热古菌。
Front Microbiol. 2018 Jan 26;9:28. doi: 10.3389/fmicb.2018.00028. eCollection 2018.
10
Ancestral Genome Estimation Reveals the History of Ecological Diversification in Agrobacterium.祖先基因组估计揭示了农杆菌生态多样化的历史。
Genome Biol Evol. 2017 Dec 1;9(12):3413-3431. doi: 10.1093/gbe/evx255.