Suppr超能文献

用于结构和功能分析的胚细胞采样。

Blast sampling for structural and functional analyses.

作者信息

Friedrich Anne, Ripp Raymond, Garnier Nicolas, Bettler Emmanuel, Deléage Gilbert, Poch Olivier, Moulinier Luc

机构信息

Laboratoire de Bioinformatique et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France.

出版信息

BMC Bioinformatics. 2007 Feb 23;8:62. doi: 10.1186/1471-2105-8-62.

Abstract

BACKGROUND

The post-genomic era is characterised by a torrent of biological information flooding the public databases. As a direct consequence, similarity searches starting with a single query sequence frequently lead to the identification of hundreds, or even thousands of potential homologues. The huge volume of data renders the subsequent structural, functional and evolutionary analyses very difficult. It is therefore essential to develop new strategies for efficient sampling of this large sequence space, in order to reduce the number of sequences to be processed. At the same time, it is important to retain the most pertinent sequences for structural and functional studies.

RESULTS

An exhaustive analysis on a large scale test set (284 protein families) was performed to compare the efficiency of four different sampling methods aimed at selecting the most pertinent sequences. These four methods sample the proteins detected by BlastP searches and can be divided into two categories: two customisable methods where the user defines either the maximal number or the percentage of sequences to be selected; two automatic methods in which the number of sequences selected is determined by the program. We focused our analysis on the potential information content of the sampled sets of sequences using multiple alignment of complete sequences as the main validation tool. The study considered two criteria: the total number of sequences in BlastP and their associated E-values. The subsequent analyses investigated the influence of the sampling methods on the E-value distributions, the sequence coverage, the final multiple alignment quality and the active site characterisation at various residue conservation thresholds as a function of these criteria.

CONCLUSION

The comparative analysis of the four sampling methods allows us to propose a suitable sampling strategy that significantly reduces the number of homologous sequences required for alignment, while at the same time maintaining the relevant information concerning the active site residues.

摘要

背景

后基因组时代的特点是生物信息如洪流般涌入公共数据库。直接结果是,从单个查询序列开始的相似性搜索常常会识别出数百甚至数千个潜在的同源物。如此大量的数据使得后续的结构、功能和进化分析变得极为困难。因此,开发新的策略以高效地对这个巨大的序列空间进行采样,从而减少待处理序列的数量至关重要。与此同时,保留用于结构和功能研究的最相关序列也很重要。

结果

针对一个大规模测试集(284个蛋白质家族)进行了详尽分析,以比较旨在选择最相关序列的四种不同采样方法的效率。这四种方法对BlastP搜索检测到的蛋白质进行采样,可分为两类:两种可定制方法,用户可定义要选择的序列的最大数量或百分比;两种自动方法,其中选择的序列数量由程序确定。我们以完整序列的多序列比对作为主要验证工具,重点分析了采样序列集的潜在信息含量。该研究考虑了两个标准:BlastP中的序列总数及其相关的E值。后续分析研究了采样方法对E值分布、序列覆盖率、最终多序列比对质量以及在不同残基保守阈值下活性位点特征作为这些标准的函数的影响。

结论

对这四种采样方法的比较分析使我们能够提出一种合适的采样策略,该策略可显著减少比对所需的同源序列数量,同时保留有关活性位点残基的相关信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1bad/1819393/d2989ad14966/1471-2105-8-62-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验